[00:00:39] (03Abandoned) 10Dzahn: vrts: replace OTRS string in exim4 config tempate [puppet] - 10https://gerrit.wikimedia.org/r/932322 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [00:18:50] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:15] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:31:02] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935869 [00:38:22] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935869 (owner: 10TrainBranchBot) [00:50:27] (03CR) 10Dzahn: "maybe this could be deployed before or during the contint replacement?" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [00:56:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935869 (owner: 10TrainBranchBot) [01:15:19] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cinder-backups: consolidate backup jobs on one host" [puppet] - 10https://gerrit.wikimedia.org/r/936020 (owner: 10Andrew Bogott) [01:18:02] (03CR) 10Dzahn: "I merged your change to add the tag, then restarted the build and publish jobs. Don't see it yet on https://docker-registry.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/932337 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [01:18:20] (03CR) 10Dzahn: miscweb: remove static_tendril classes and files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932337 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [01:18:25] (03CR) 10Dzahn: "I merged your change to add the tag, then restarted the build and publish jobs. Don't see it yet on https://docker-registry.wikimedia.org/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [01:38:20] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:51:29] (03PS4) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) [01:53:41] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42330/console" [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [01:55:32] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [01:55:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:08:20] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:11] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:18] (03CR) 10Stevemunene: [V: 03+1] analytics: remove puppet references for analytics[1058-1069] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [02:28:20] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:27] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:28] (03PS1) 10Andrew Bogott: wmcs-cinder-backups: increase per-backup timeout [puppet] - 10https://gerrit.wikimedia.org/r/936128 [02:57:51] (03PS1) 10Andrew Bogott: cinder backups: increase chunked backup file size [puppet] - 10https://gerrit.wikimedia.org/r/936129 [02:58:12] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backups: increase per-backup timeout [puppet] - 10https://gerrit.wikimedia.org/r/936128 (owner: 10Andrew Bogott) [02:58:31] (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: increase chunked backup file size [puppet] - 10https://gerrit.wikimedia.org/r/936129 (owner: 10Andrew Bogott) [03:15:37] are the Phabricator issues known? [03:20:38] filed as T341311, in any case [03:20:39] T341311: Sporadic MySQL connection errors in Phabricator - https://phabricator.wikimedia.org/T341311 [03:35:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:40:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:55:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:02:50] tgr_: I think they've gotten worse. I keep trying to add to your ticket but am unable, phab is too unstable [04:03:29] yeah, seems to be getting more frequent [04:05:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:05:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:08:13] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from phabricator.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=ulsfo%20prometheus/ops&var-cluster=text&var-origin=phabricator.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:10:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:10:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:15:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:19:34] Sigh. I ack'ed the page. Looking [04:19:50] o/ [04:20:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:25:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:38:13] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from phabricator.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=ulsfo%20prometheus/ops&var-cluster=text&var-origin=phabricator.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:40:01] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:47:15] 10SRE, 10DBA, 10Phabricator: Sporadic MySQL connection errors in Phabricator - https://phabricator.wikimedia.org/T341311 (10Novem_Linguae) [05:01:43] 10SRE, 10DBA, 10Phabricator: Sporadic MySQL connection errors in Phabricator - https://phabricator.wikimedia.org/T341311 (10colewhite) 05Open→03Resolved a:03colewhite The extra load on Phabricator has been removed. Going to optimistically resolve this, but please reopen if it comes back. [05:38:21] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:58] 10SRE, 10DBA, 10Phabricator: Sporadic MySQL connection errors in Phabricator - https://phabricator.wikimedia.org/T341311 (10Aklapper) > The extra load on Phabricator has been removed. @colewhite: Are there any more details about actions taken that could be shared in public please, if available? Thanks! [05:45:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::builder: allow using bookworm as a base image [puppet] - 10https://gerrit.wikimedia.org/r/935686 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230707T0600) [06:28:20] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:32:47] (03Restored) 10Giuseppe Lavagetto: Be strict on undefined variables such as seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [06:33:03] (03PS2) 10Giuseppe Lavagetto: Remove seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/754544 [06:33:05] (03PS5) 10Giuseppe Lavagetto: Be strict on undefined variables in templates [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [06:33:07] (03PS1) 10Giuseppe Lavagetto: tox: add python 3.11 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/936139 [06:33:09] (03PS1) 10Giuseppe Lavagetto: Add boundary to python-requests so we don't switch to urllib3 2.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/936140 [06:33:32] (03CR) 10CI reject: [V: 04-1] Add boundary to python-requests so we don't switch to urllib3 2.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/936140 (owner: 10Giuseppe Lavagetto) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230707T0700) [07:02:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:12:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:30:10] (03CR) 10JMeybohm: [C: 04-1] deployment_server: add REPL for mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [07:30:54] (03PS1) 10Slyngshede: Function:htpasswd Avoid initializing variable multiple times. [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) [07:31:16] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:31:19] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::node: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936043 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:31:21] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936041 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:31:23] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] calico::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:31:32] (03CR) 10CI reject: [V: 04-1] Function:htpasswd Avoid initializing variable multiple times. [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) (owner: 10Slyngshede) [07:33:46] (03PS2) 10Slyngshede: Function:htpasswd Avoid initializing variable multiple times. [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) [07:34:20] (03CR) 10CI reject: [V: 04-1] Function:htpasswd Avoid initializing variable multiple times. [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) (owner: 10Slyngshede) [07:34:27] (03PS2) 10JMeybohm: Add mesh.configuration 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935679 (https://phabricator.wikimedia.org/T300324) [07:34:29] (03PS2) 10JMeybohm: mesh.configuration: Refactor max_requests_per_connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/935680 (https://phabricator.wikimedia.org/T304124) [07:34:31] (03PS2) 10JMeybohm: mesh.configuration: Remove tls_minimum_protocol_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935684 (https://phabricator.wikimedia.org/T337453) [07:34:33] (03PS6) 10JMeybohm: mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) [07:34:35] (03PS3) 10JMeybohm: mesh.configuration: Update all charts t 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935754 (https://phabricator.wikimedia.org/T300324) [07:34:38] (03CR) 10Filippo Giunchedi: Add monitoring for mirrors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [07:36:19] (03PS4) 10JMeybohm: mesh.configuration: Update all charts to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935754 (https://phabricator.wikimedia.org/T300324) [07:36:29] (03PS3) 10Slyngshede: Function:htpasswd Avoid initializing variable multiple times. [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) [07:37:07] (03CR) 10Filippo Giunchedi: [C: 03+1] sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [07:37:19] (03CR) 10CI reject: [V: 04-1] Function:htpasswd Avoid initializing variable multiple times. [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) (owner: 10Slyngshede) [07:37:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/754544 (owner: 10Giuseppe Lavagetto) [07:39:29] (03PS4) 10Slyngshede: Function:htpasswd Avoid initializing variable multiple times. [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) [07:40:07] (03Merged) 10jenkins-bot: Remove seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/754544 (owner: 10Giuseppe Lavagetto) [07:40:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Be strict on undefined variables in templates [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [07:42:34] (03Merged) 10jenkins-bot: Be strict on undefined variables in templates [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [07:43:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/936003 (https://phabricator.wikimedia.org/T341045) (owner: 10ArielGlenn) [07:45:15] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42331/console" [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) (owner: 10Slyngshede) [07:48:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tox: add python 3.11 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/936139 (owner: 10Giuseppe Lavagetto) [07:49:10] (03CR) 10Slyngshede: "Any helpful hints to how I test that this doesn't break anything?" [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) (owner: 10Slyngshede) [07:50:57] (03Merged) 10jenkins-bot: tox: add python 3.11 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/936139 (owner: 10Giuseppe Lavagetto) [07:56:22] (03PS1) 10Muehlenhoff: Remove jgreen from ops group [puppet] - 10https://gerrit.wikimedia.org/r/936215 (https://phabricator.wikimedia.org/T336231) [07:57:00] (03CR) 10Giuseppe Lavagetto: "recheck" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/936140 (owner: 10Giuseppe Lavagetto) [08:00:08] (03CR) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [08:00:29] (03PS4) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) [08:04:36] (03CR) 10JMeybohm: [C: 03+1] deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [08:05:39] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-test[1006-1010].eqiad.wmnet with reason: resetting cluster [08:05:43] (03PS1) 10Filippo Giunchedi: Remove misleading comments re: local testing [alerts] - 10https://gerrit.wikimedia.org/r/936220 [08:05:49] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [08:05:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-test[1006-1010].eqiad.wmnet with reason: resetting cluster [08:06:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:11:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:15:46] (03CR) 10JMeybohm: [C: 03+2] Add mesh.configuration 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935679 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [08:15:51] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Refactor max_requests_per_connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/935680 (https://phabricator.wikimedia.org/T304124) (owner: 10JMeybohm) [08:15:57] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Remove tls_minimum_protocol_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935684 (https://phabricator.wikimedia.org/T337453) (owner: 10JMeybohm) [08:16:01] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [08:16:16] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Update all charts to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935754 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [08:16:19] (03Merged) 10jenkins-bot: Add mesh.configuration 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935679 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [08:16:22] (03Merged) 10jenkins-bot: mesh.configuration: Refactor max_requests_per_connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/935680 (https://phabricator.wikimedia.org/T304124) (owner: 10JMeybohm) [08:16:26] (03Merged) 10jenkins-bot: mesh.configuration: Remove tls_minimum_protocol_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935684 (https://phabricator.wikimedia.org/T337453) (owner: 10JMeybohm) [08:16:31] (03Merged) 10jenkins-bot: mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [08:20:18] (03Merged) 10jenkins-bot: mesh.configuration: Update all charts to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935754 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [08:34:01] (03CR) 10Jbond: [C: 03+1] "lgtm assuming things are cleared up" [puppet] - 10https://gerrit.wikimedia.org/r/936215 (https://phabricator.wikimedia.org/T336231) (owner: 10Muehlenhoff) [08:37:01] (03PS5) 10Jbond: pybal: update check to conform to the nagios plugin api [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) [08:38:36] (03CR) 10Jbond: [V: 03+1] do not merger: example of phabricator pcc run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895722 (owner: 10Jbond) [08:38:48] (03Abandoned) 10Jbond: do not merger: example of phabricator pcc run [puppet] - 10https://gerrit.wikimedia.org/r/895722 (owner: 10Jbond) [08:43:27] (03PS1) 10Btullis: Bump datahub image and deploy standalone MAE/MCE consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/936228 (https://phabricator.wikimedia.org/T329514) [08:46:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10ItamarWMDE) Sounds like a good idea to me, thank you for the suggestion @Dzahn [08:46:56] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: xhgui1002.eqiad.wmnet [08:46:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: xhgui1002.eqiad.wmnet [08:47:09] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: xhgui2002.codfw.wmnet [08:47:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: xhgui2002.codfw.wmnet [08:48:26] (03CR) 10Btullis: [C: 03+2] Bump datahub image and deploy standalone MAE/MCE consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/936228 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:48:27] !log installing bookworm kernel updates [08:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:15] (03Merged) 10jenkins-bot: Bump datahub image and deploy standalone MAE/MCE consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/936228 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:49:32] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [08:50:17] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:52:14] (03PS4) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [08:52:16] (03CR) 10David Caro: replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [08:52:18] (03PS1) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 [08:52:20] (03PS1) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [08:53:59] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [08:56:10] (03PS1) 10Slyngshede: data: extend nickifeajika until June 30th 2024. [puppet] - 10https://gerrit.wikimedia.org/r/936233 [08:56:13] (03CR) 10CI reject: [V: 04-1] WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [08:56:18] (03CR) 10CI reject: [V: 04-1] wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro) [08:56:36] (03CR) 10David Caro: "Sorry for the refactor mid-flight." [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [08:57:21] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove misleading comments re: local testing [alerts] - 10https://gerrit.wikimedia.org/r/936220 (owner: 10Filippo Giunchedi) [08:57:26] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [08:57:35] (03PS2) 10Slyngshede: data: extend nickifeajika until June 30th 2024. [puppet] - 10https://gerrit.wikimedia.org/r/936233 [08:59:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/936233 (owner: 10Slyngshede) [09:00:09] (03CR) 10Slyngshede: [C: 03+2] data: extend nickifeajika until June 30th 2024. [puppet] - 10https://gerrit.wikimedia.org/r/936233 (owner: 10Slyngshede) [09:02:00] (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:07:07] (03CR) 10Jbond: "change lgtm see inline for testing options" [puppet] - 10https://gerrit.wikimedia.org/r/936213 (https://phabricator.wikimedia.org/T228966) (owner: 10Slyngshede) [09:08:18] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: drop unused hiera overrides [puppet] - 10https://gerrit.wikimedia.org/r/936234 [09:10:36] (03CR) 10David Caro: "the echo_server should not be there (yet)" [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [09:12:19] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC noop: https://puppet-compiler.wmflabs.org/output/936234/42332/" [puppet] - 10https://gerrit.wikimedia.org/r/936234 (owner: 10Arturo Borrero Gonzalez) [09:12:39] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] analytics: remove puppet references for analytics[1058-1069] [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [09:12:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [09:13:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2003.codfw.wmnet [09:15:14] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: codfw: use someup check for haproxy BGP check [puppet] - 10https://gerrit.wikimedia.org/r/936235 (https://phabricator.wikimedia.org/T324992) [09:15:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: codfw: use someup check for haproxy BGP check [puppet] - 10https://gerrit.wikimedia.org/r/936235 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [09:17:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2003.codfw.wmnet [09:17:28] (03PS3) 10Arturo Borrero Gonzalez: cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200) [09:18:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [09:19:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet [09:19:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1004.eqiad.wmnet [09:19:41] (03PS2) 10Filippo Giunchedi: Fix CirrusSearchJobQueueLagTooHigh to use histograms [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris) [09:20:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1004.eqiad.wmnet [09:21:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [09:22:05] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: add openstack-next.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) [09:23:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add boundary to python-requests so we don't switch to urllib3 2.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/936140 (owner: 10Giuseppe Lavagetto) [09:24:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists1003.wikimedia.org [09:24:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet [09:25:13] (03Merged) 10jenkins-bot: Add boundary to python-requests so we don't switch to urllib3 2.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/936140 (owner: 10Giuseppe Lavagetto) [09:26:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [09:29:34] !log stevemunene@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [09:29:37] (03PS4) 10Arturo Borrero Gonzalez: cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200) [09:29:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [09:32:08] (03PS1) 10Jbond: puppetboard: add docs [puppet] - 10https://gerrit.wikimedia.org/r/936237 (https://phabricator.wikimedia.org/T341268) [09:33:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [09:34:57] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host lists1003.wikimedia.org [09:35:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [09:37:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [09:37:40] (03PS1) 10Jbond: pupetboard::bookworm: Enable client authentication [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) [09:38:21] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:39:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [09:39:19] (03CR) 10Jelto: [C: 03+1] miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [09:41:01] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [09:42:23] (03PS5) 10Arturo Borrero Gonzalez: cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200) [09:42:29] (03CR) 10Jelto: "If possible I'd like to switch contint hosts first and deploy this once we verified everything still works." [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [09:43:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:45:35] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [09:46:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [09:46:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42334/console" [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [09:46:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [09:48:17] (03CR) 10Jbond: [C: 03+2] puppetboard: add docs [puppet] - 10https://gerrit.wikimedia.org/r/936237 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [09:51:43] (03PS1) 10Effie Mouzeli: ipoid: enable service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/936239 [09:52:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [09:52:18] (03CR) 10CI reject: [V: 04-1] ipoid: enable service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/936239 (owner: 10Effie Mouzeli) [09:52:57] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host debmonitor2003.codfw.wmnet [09:55:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [09:55:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb2003.codfw.wmnet [09:57:52] (03PS2) 10Effie Mouzeli: ipoid: enable service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/936239 [09:59:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [10:00:09] (03CR) 10JMeybohm: [C: 03+1] ipoid: enable service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/936239 (owner: 10Effie Mouzeli) [10:00:22] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: enable service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/936239 (owner: 10Effie Mouzeli) [10:00:57] (03PS2) 10Jbond: pupetboard::bookworm: Enable client authentication [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) [10:00:59] (03PS1) 10Jbond: uwsgi::app: update docs [puppet] - 10https://gerrit.wikimedia.org/r/936240 (https://phabricator.wikimedia.org/T341268) [10:01:12] (03Merged) 10jenkins-bot: ipoid: enable service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/936239 (owner: 10Effie Mouzeli) [10:03:56] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [10:05:05] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [10:05:41] !log rebooting puppetserver2001 [10:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb2003.codfw.wmnet [10:09:28] !log rebooting puppetserver1001 [10:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:08] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: add openstack-next.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) [10:13:03] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: add openstack-next.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) (owner: 10Arturo Borrero Gonzalez) [10:13:12] !log rebooting puppetdb1003 [10:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:36] (03CR) 10Majavah: [C: 04-1] "224-27.56.15.185.in-addr.arpa and 240-29.56.15.185.in-addr.arpa already exist in the netbox repo, should we include those?" [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) (owner: 10Arturo Borrero Gonzalez) [10:16:11] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: add openstack-next.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) [10:19:46] (03CR) 10Clément Goubert: [C: 03+1] Fix CirrusSearchJobQueueLagTooHigh to use histograms [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris) [10:24:10] 10SRE, 10DBA, 10Phabricator, 10collaboration-services: Sporadic MySQL connection errors in Phabricator - https://phabricator.wikimedia.org/T341311 (10LSobanski) [10:28:20] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:32:15] (03PS2) 10Jbond: uwsgi::app: update docs [puppet] - 10https://gerrit.wikimedia.org/r/936240 (https://phabricator.wikimedia.org/T341268) [10:32:17] (03PS3) 10Jbond: pupetboard::bookworm: Enable client authentication [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) [10:32:20] (03PS1) 10Jbond: uwsgi::app: add ability to configure the systemd user and group [puppet] - 10https://gerrit.wikimedia.org/r/936243 (https://phabricator.wikimedia.org/T341268) [10:37:13] (03PS4) 10Jbond: pupetboard::bookworm: Enable client authentication [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) [10:38:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42337/console" [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [10:40:25] (03PS1) 10Arturo Borrero Gonzalez: templates: add 56.15.185.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [10:41:25] (03CR) 10CI reject: [V: 04-1] templates: add 56.15.185.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [10:41:58] (03CR) 10Jbond: [C: 03+2] "change is NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/936240 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [10:43:34] (03CR) 10Volans: "LGTM, I'd just check with PCC that is a noop with the current puppetboard" [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [10:43:38] (03CR) 10Volans: [C: 03+1] pupetboard::bookworm: Enable client authentication [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [10:44:07] (03CR) 10Cathal Mooney: "A few comments. I think we also need the delegation for 0-25.56.15.185.in-addr.arpa to designate, and the CNAMEs for 0..127 pointing to e" [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [10:45:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42339/console" [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [10:46:08] (03CR) 10Jbond: [V: 03+1] pupetboard::bookworm: Enable client authentication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [10:47:17] (03CR) 10Jbond: [C: 03+2] "change is noop: https://puppet-compiler.wmflabs.org/output/936243/42338/" [puppet] - 10https://gerrit.wikimedia.org/r/936243 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [10:47:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] pupetboard::bookworm: Enable client authentication [puppet] - 10https://gerrit.wikimedia.org/r/936238 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [10:48:29] (03PS2) 10Arturo Borrero Gonzalez: templates: add 56.15.185.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [10:49:24] (03CR) 10CI reject: [V: 04-1] templates: add 56.15.185.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [10:52:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:52:41] PROBLEM - Check systemd state on puppetboard1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetboard.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:55] ^^ this is me [10:53:25] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: connect to address localhost and port 8001: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [10:54:21] ack [10:54:39] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: include netbox-generated records [dns] - 10https://gerrit.wikimedia.org/r/936247 (https://phabricator.wikimedia.org/T341338) [10:55:33] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: include netbox-generated records [dns] - 10https://gerrit.wikimedia.org/r/936247 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [10:57:20] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [10:57:53] (03Abandoned) 10Arturo Borrero Gonzalez: wikimediacloud.org: include netbox-generated records [dns] - 10https://gerrit.wikimedia.org/r/936247 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [10:58:10] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [10:58:22] (03CR) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [11:02:43] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:04:48] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikimediacloud - aborrero@cumin1001" [11:05:32] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikimediacloud - aborrero@cumin1001" [11:05:32] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:07:02] (03CR) 10Arturo Borrero Gonzalez: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [11:09:10] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42341/console" [puppet] - 10https://gerrit.wikimedia.org/r/936062 (https://phabricator.wikimedia.org/T258697) (owner: 10Alexandros Kosiaris) [11:11:35] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/936062 (https://phabricator.wikimedia.org/T258697) (owner: 10Alexandros Kosiaris) [11:11:36] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) For switching the services via Puppet, that is nowadays done via single Hiera variable (introduced by [[ https://g... [11:12:43] (03PS1) 10Btullis: Update the datahub charts with new environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/936250 (https://phabricator.wikimedia.org/T329514) [11:13:38] 10SRE, 10observability, 10serviceops, 10Patch-For-Review: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10akosiaris) 05Open→03Resolved a:03akosiaris PCC at https://puppet-compiler.wmflabs.org/output/936062/42341/ says 0 diff for alert hosts, lvs... [11:14:33] (03PS4) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [11:15:26] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [11:17:00] (03PS2) 10Btullis: Update the datahub charts with new environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/936250 (https://phabricator.wikimedia.org/T329514) [11:17:05] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:18:07] (03Abandoned) 10Ladsgroup: ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/935857 (https://phabricator.wikimedia.org/T341000) (owner: 10Ladsgroup) [11:18:19] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 10788 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [11:18:48] (03PS1) 10Jbond: puppetboard::bookworm: correct parameter [puppet] - 10https://gerrit.wikimedia.org/r/936251 (https://phabricator.wikimedia.org/T341268) [11:19:26] (03CR) 10Btullis: [C: 03+2] Update the datahub charts with new environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/936250 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:20:16] (03Merged) 10jenkins-bot: Update the datahub charts with new environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/936250 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:22:23] (03CR) 10Jbond: [C: 03+2] puppetboard::bookworm: correct parameter [puppet] - 10https://gerrit.wikimedia.org/r/936251 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [11:27:14] !log Stopped zuul-merger contint1002 [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:52] (03CR) 10Jbond: [C: 03+2] puppedb::bookworm: Force client auth [puppet] - 10https://gerrit.wikimedia.org/r/935863 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:31:01] PROBLEM - zuul_merger_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [11:31:08] (03PS5) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [11:32:28] (03CR) 10Gmodena: "This change is ready for review." (037 comments) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [11:34:23] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond) [11:34:49] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Puppetboard: configure client auth - https://phabricator.wikimedia.org/T341268 (10jbond) 05Open→03Resolved a:03jbond [11:37:09] RECOVERY - zuul_merger_service_running on contint1002 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [11:42:38] (03PS1) 10QChris: Add .gitreview [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936256 [11:42:40] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936256 (owner: 10QChris) [11:42:40] !log Enabled zuul-merger contint1002, disabled it on contint2001 and marked that host as under maintenance in Icinga for the next two hours [11:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:55] (03PS1) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) [11:45:28] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:46:18] (03CR) 10Majavah: [C: 04-1] wikimediacloud.org: add netbox includes (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [11:47:06] (03CR) 10Cathal Mooney: "Few comments, overall looks good, there are a more few entries I think we can move to manage through netbox which probably simplifies thin" [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [11:47:32] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikimediacloud - aborrero@cumin1001" [11:47:40] (03CR) 10Muehlenhoff: "Thanks for the extensive feedback! I'll go through the comments later today and Monday and update those still unaddressed." [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:48:16] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikimediacloud - aborrero@cumin1001" [11:48:16] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:48:19] (03CR) 10Cathal Mooney: wikimediacloud.org: add netbox includes (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [11:48:21] (03PS6) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [11:49:14] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [11:49:59] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [11:50:47] (03PS7) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [11:51:59] (03CR) 10Cathal Mooney: wikimediacloud.org: add netbox includes (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [11:52:13] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) Thanks for testing and running rsync! I created a rough checklist in the task description. Feel free to edit if I... [11:52:51] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [11:55:39] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [11:56:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42342/console" [puppet] - 10https://gerrit.wikimedia.org/r/936076 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:56:27] (03PS2) 10Jbond: puppetboard: Add additional site to proxy puppet7 config [puppet] - 10https://gerrit.wikimedia.org/r/936076 (https://phabricator.wikimedia.org/T338811) [11:56:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42343/console" [puppet] - 10https://gerrit.wikimedia.org/r/936076 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:57:29] (03PS3) 10Jbond: puppetboard: Add additional site to proxy puppet7 config [puppet] - 10https://gerrit.wikimedia.org/r/936076 (https://phabricator.wikimedia.org/T338811) [11:57:51] RECOVERY - Check systemd state on puppetboard1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:25] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:59:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42344/console" [puppet] - 10https://gerrit.wikimedia.org/r/936076 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:00:45] (03PS8) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [12:01:22] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikimediacloud - aborrero@cumin1001" [12:02:03] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikimediacloud - aborrero@cumin1001" [12:02:03] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:04:42] (03CR) 10Filippo Giunchedi: "To keep the archives happy:" [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris) [12:05:06] (03PS1) 10Hashar: contint: move zuul-merger from contint2001 to contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/936266 (https://phabricator.wikimedia.org/T324659) [12:05:35] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/936266 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [12:15:12] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10dcausse) a:03dcausse [12:17:21] !log Re-enabled zuul-merger on contint2001 and removed the Icinga maintenance window [12:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:51] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/936266/2050/" [puppet] - 10https://gerrit.wikimedia.org/r/936266 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [12:31:47] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [12:32:03] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) >>! In T324659#8997167, @Jelto wrote: > Thanks for testing and running rsync! > > I created a rough checklist in... [12:37:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetboard: Add additional site to proxy puppet7 config [puppet] - 10https://gerrit.wikimedia.org/r/936076 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:39:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/936095 (https://phabricator.wikimedia.org/T341290) (owner: 10Bking) [12:46:55] (03PS1) 10Btullis: Configure datahub-gms not to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/936271 (https://phabricator.wikimedia.org/T329514) [12:47:03] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [12:48:14] (03CR) 10Btullis: [C: 03+2] Configure datahub-gms not to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/936271 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:49:02] (03Merged) 10jenkins-bot: Configure datahub-gms not to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/936271 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:49:23] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) Fun Friday finding: neither contint1002 (recently moved) nor contint2002 where allowed to ssh to the `integration`... [12:49:52] (03PS1) 10Btullis: Fix the path to the jaas configuration file for the datahub-frontend [deployment-charts] - 10https://gerrit.wikimedia.org/r/936272 (https://phabricator.wikimedia.org/T329514) [12:50:50] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [12:51:23] (03PS1) 10Jbond: puppetmaster: enable submit only from puppet5 to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) [12:55:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42345/console" [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:57:13] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10TomerLerner) @MSantos @akosiaris thanks for your help with this! We call /mobile-sections-lead on the server side and had 403 f... [12:57:25] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [12:58:34] (03PS2) 10Btullis: Fix the path to the jaas configuration file for the datahub-frontend [deployment-charts] - 10https://gerrit.wikimedia.org/r/936272 (https://phabricator.wikimedia.org/T329514) [13:00:14] (03CR) 10Btullis: [C: 03+2] Fix the path to the jaas configuration file for the datahub-frontend [deployment-charts] - 10https://gerrit.wikimedia.org/r/936272 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:01:05] (03Merged) 10jenkins-bot: Fix the path to the jaas configuration file for the datahub-frontend [deployment-charts] - 10https://gerrit.wikimedia.org/r/936272 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:19:47] (03CR) 10Muehlenhoff: "Initial comments, some pending." [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:21:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Fine by me." [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris) [13:24:37] (03PS11) 10Muehlenhoff: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [13:29:47] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936266 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [13:33:39] (03CR) 10Ottomata: data-engineering: add alerts flink enrichment apps (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [13:33:44] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) Lets avoid using `MWOffliner` as it is a different API consumer and we wont be able to track the deprecation. [13:35:29] (03PS12) 10Muehlenhoff: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [13:37:11] (03CR) 10Muehlenhoff: Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:38:21] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:41:15] (03PS1) 10Ssingh: cumin: update dns-auth-canary for dns1004 [puppet] - 10https://gerrit.wikimedia.org/r/936278 [13:45:26] (03CR) 10Volans: "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/936278 (owner: 10Ssingh) [13:46:14] (03PS4) 10Gmodena: data-engineering: add alerts flink enrichment apps [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) [13:47:27] (03CR) 10Ssingh: [C: 03+2] cumin: update dns-auth-canary for dns1004 [puppet] - 10https://gerrit.wikimedia.org/r/936278 (owner: 10Ssingh) [13:48:41] (03PS1) 10Jbond: puppeserver: support submit_only urls [puppet] - 10https://gerrit.wikimedia.org/r/936279 (https://phabricator.wikimedia.org/T338811) [13:51:08] (03CR) 10CI reject: [V: 04-1] puppeserver: support submit_only urls [puppet] - 10https://gerrit.wikimedia.org/r/936279 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [13:51:59] (03PS5) 10Gmodena: data-engineering: add alerts flink enrichment apps [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) [13:53:44] (03PS2) 10Jbond: puppeserver: support submit_only urls [puppet] - 10https://gerrit.wikimedia.org/r/936279 (https://phabricator.wikimedia.org/T338811) [13:55:14] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341168 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm found mgmt cable not seated properly. reseated both ends and checked connection. can ssh into machine. [13:55:41] (03PS1) 10Giuseppe Lavagetto: mw-debug-repl: improve UX [puppet] - 10https://gerrit.wikimedia.org/r/936280 (https://phabricator.wikimedia.org/T341197) [13:57:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42348/console" [puppet] - 10https://gerrit.wikimedia.org/r/936279 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [13:58:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppeserver: support submit_only urls [puppet] - 10https://gerrit.wikimedia.org/r/936279 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [13:58:31] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [13:58:37] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s) [13:59:04] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [13:59:11] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 07s) [14:03:16] (03PS1) 10Jbond: puppetmaster: enable submit only from puppet7 to puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/936281 (https://phabricator.wikimedia.org/T338811) [14:04:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42349/console" [puppet] - 10https://gerrit.wikimedia.org/r/936281 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:05:34] (03PS2) 10Giuseppe Lavagetto: mw-debug-repl: improve UX [puppet] - 10https://gerrit.wikimedia.org/r/936280 (https://phabricator.wikimedia.org/T341197) [14:08:20] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:14:19] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:17] (03PS1) 10Jbond: Puppetserver: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/936283 [14:21:40] (03CR) 10Jbond: [C: 03+2] Puppetserver: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/936283 (owner: 10Jbond) [14:26:16] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [14:26:35] PROBLEM - Query Service HTTP Port on wdqs2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:26:47] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:28:21] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:58] ^ bking, guessing that's expected? [14:38:14] (03PS1) 10Jbond: puppetd::site: Add a way to open the firewall [puppet] - 10https://gerrit.wikimedia.org/r/936284 (https://phabricator.wikimedia.org/T338811) [14:38:16] (03PS1) 10Jbond: puppetdb: add allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/936285 (https://phabricator.wikimedia.org/T338811) [14:39:01] (03CR) 10Clément Goubert: [C: 03+1] "lgtm, extremely minor nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/936280 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [14:39:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42350/console" [puppet] - 10https://gerrit.wikimedia.org/r/936285 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:40:50] (03CR) 10Jbond: [C: 03+2] puppetd::site: Add a way to open the firewall [puppet] - 10https://gerrit.wikimedia.org/r/936284 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:40:52] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: add allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/936285 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:42:30] sukhe Thanks for the heads-up...checking [14:43:03] FWiW, 2018 is expected but not 2012 [14:43:43] thanks for checking (on on-call so asking :) [14:44:27] RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:24] I'm working on fixing our cookbooks so they don't remove downtimes for non-prod hosts in https://phabricator.wikimedia.org/T340793 . Sorry for the noise! [14:45:44] np at all and thanks for checking [14:46:51] (03Abandoned) 10Bking: wdqs.data-transfer: reformat using black [cookbooks] - 10https://gerrit.wikimedia.org/r/934595 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [14:47:56] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:48:21] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:49] Guess we need an auto-restart for the auto-restarter ;( [14:49:40] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [14:49:45] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s) [14:50:29] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [14:50:34] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s) [14:51:06] (03PS1) 10Ssingh: P:monitoring: improve check_service_restart.py [puppet] - 10https://gerrit.wikimedia.org/r/936286 [14:52:13] (03PS4) 10Jbond: wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) [14:52:46] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42351/console" [puppet] - 10https://gerrit.wikimedia.org/r/936286 (owner: 10Ssingh) [14:53:17] (03CR) 10Jbond: [C: 03+2] wmnet: add SRV records for puppet compileres and CA [dns] - 10https://gerrit.wikimedia.org/r/935396 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [14:53:32] (03PS1) 10Ladsgroup: sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 [14:54:24] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:monitoring: improve check_service_restart.py [puppet] - 10https://gerrit.wikimedia.org/r/936286 (owner: 10Ssingh) [14:55:52] (03CR) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes (038 comments) [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [14:56:14] (03PS9) 10Arturo Borrero Gonzalez: wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) [14:57:22] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [14:57:57] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1001.eqiad.wmnet with OS bullseye [14:58:12] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 49s) [14:58:26] (03CR) 10Filippo Giunchedi: "+David for heads up, LGTM too" [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris) [14:58:53] (03PS3) 10Jbond: puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) [15:00:42] (03CR) 10CI reject: [V: 04-1] puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [15:02:12] (03PS1) 10Ssingh: P:ntp: update the stale conf file check intervals [puppet] - 10https://gerrit.wikimedia.org/r/936291 [15:03:13] (03CR) 10Filippo Giunchedi: data-engineering: add alerts flink enrichment apps (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [15:04:00] (03PS2) 10Jbond: puppet::agent: add support for srv records [puppet] - 10https://gerrit.wikimedia.org/r/935407 (https://phabricator.wikimedia.org/T341053) [15:04:02] (03PS4) 10Jbond: puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) [15:04:50] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [15:05:02] (03CR) 10Ssingh: [C: 03+2] P:ntp: update the stale conf file check intervals [puppet] - 10https://gerrit.wikimedia.org/r/936291 (owner: 10Ssingh) [15:05:40] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 50s) [15:05:48] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:05:56] RECOVERY - Query Service HTTP Port on wdqs2021 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:06:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42353/console" [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [15:06:30] RECOVERY - WDQS SPARQL on wdqs2021 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:07:50] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:08:45] (03PS5) 10Jbond: puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) [15:10:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42354/console" [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [15:15:44] RECOVERY - Query Service HTTP Port on wdqs2018 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:17:42] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Configure SRV records for new puppet infrastructure - https://phabricator.wikimedia.org/T341053 (10jbond) Still intend to run tests on one of the pops, however this is the SRV queries made during a puppet run so things are look... [15:18:43] (03CR) 10Jbond: [C: 03+2] puppet::agent: add support for srv records [puppet] - 10https://gerrit.wikimedia.org/r/935407 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [15:18:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet::agent: configure srv_domain based on site [puppet] - 10https://gerrit.wikimedia.org/r/935409 (https://phabricator.wikimedia.org/T341053) (owner: 10Jbond) [15:22:54] (03PS3) 10Giuseppe Lavagetto: mw-debug-repl: improve UX [puppet] - 10https://gerrit.wikimedia.org/r/936280 (https://phabricator.wikimedia.org/T341197) [15:23:24] (03CR) 10Ottomata: data-engineering: add alerts flink enrichment apps (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [15:24:45] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [15:26:00] (03PS1) 10Btullis: Deploy a new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936295 (https://phabricator.wikimedia.org/T329514) [15:26:21] (03PS5) 10Kamila Součková: [WIP] add Benthos cache invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 [15:27:03] (03CR) 10CI reject: [V: 04-1] [WIP] add Benthos cache invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (owner: 10Kamila Součková) [15:27:07] (03PS4) 10Clément Goubert: mw-debug-repl: improve UX [puppet] - 10https://gerrit.wikimedia.org/r/936280 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [15:27:23] (03CR) 10Btullis: [C: 03+2] Deploy a new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936295 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:28:05] (03CR) 10Clément Goubert: [C: 03+1] mw-debug-repl: improve UX (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936280 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [15:28:11] (03Merged) 10jenkins-bot: Deploy a new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936295 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:29:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond) Seems that puppet5 is fine sending things to puppet7. however sending from puppet7 to puppet5 works for the facts and the catalog but fails for... [15:30:21] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:31:47] (03PS1) 10Ssingh: Release pdns-recursor 4.8.4-1+wmf11u1. [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 [15:32:14] (03CR) 10Ssingh: "CI patch is at I629e21278e48fee48a2f707aaee67bf2dc81c0f5" [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh) [15:32:45] (03PS5) 10Dzahn: miscweb: add statictendril release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) [15:33:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: add netbox includes [dns] - 10https://gerrit.wikimedia.org/r/936246 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [15:33:30] (03CR) 10Dzahn: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [15:33:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [15:33:57] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:34:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10jbond) 05Open→03Resolved a:03jbond We now have puppetserver, db and puppetboard running on both codfw and eqiad [15:35:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:40:36] (03PS1) 10Btullis: Bump the version number of the datahub-frontend chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936301 (https://phabricator.wikimedia.org/T329514) [15:42:53] (03CR) 10Btullis: [C: 03+2] Bump the version number of the datahub-frontend chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936301 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:43:32] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb1001.eqiad.wmnet with OS bullseye [15:43:44] (03Merged) 10jenkins-bot: Bump the version number of the datahub-frontend chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936301 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:43:51] (03CR) 10Ssingh: [C: 03+1] "@Moritz: Would appreciate your review and a +1 here before merging." [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar) [15:45:53] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:46:31] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Nikki) >>! In T325607#8898528, @SCherukuwada wrote: > @Soda Yeah Navboxes would indeed have helped. > > Tell me if this makes s... [15:46:49] !log bking@cumin1001 conftool action : set/pooled=yes; selector: service=(wdqs|wdqs-ssl|wdqs-heavy-queries),name=wdqs2020.codfw.wmnet [15:47:37] !log aborrero@cumin1001 START - Cookbook sre.hosts.provision for host cloudlb1001.mgmt.eqiad.wmnet with reboot policy FORCED [15:49:08] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:50:48] !log bking@cumin1001 conftool action : set/weight=10; selector: name=wdqs2020.codfw.wmnet [15:51:16] !log bking@cumin1001 conftool action : set/pooled=yes; selector: name=wdqs2020.codfw.wmnet [15:53:35] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb1001.mgmt.eqiad.wmnet with reboot policy FORCED [15:59:02] 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10aborrero) a:05aborrero→03Papaul hey @papaul or @Jclark-ctr I'm requesting help with this host. We are trying to reimage after re... [16:00:24] (03PS2) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) [16:00:59] !log bking@cumin1001 conftool action : set/pooled=no; selector: name=wdqs2020.codfw.wmnet [16:01:13] (03PS1) 10Elukey: profile::kafka: update prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) [16:01:32] (03PS3) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) [16:02:05] (03CR) 10Elukey: "Followed https://wikitech.wikimedia.org/wiki/Prometheus#JMX to check the Mbean, it shows the correct value (a float from 0 to 1)." [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [16:02:25] (03PS6) 10Arturo Borrero Gonzalez: cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200) [16:09:00] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 161 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [16:09:22] oh uh [16:10:11] sukhe: it's big but going down [16:10:56] lots of bounces [16:11:21] RhinosF1: yeah seems to be getting better, let's see [16:11:25] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10TomerLerner) It seems "Wikiwand/0.1 (https://www.wikiwand.com; admin@wikiwand.com)" is blocked on some (if not all) end points... [16:11:26] * sukhe sprinkles some magic dust [16:14:34] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [16:15:47] woho, the magic dust work [16:15:48] ed [16:16:12] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2017 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:19:22] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:17] !log Restarting CI Jenkins due to a confusion in the next build number leading to intermittent 404 when browsing console links | T341348 [16:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:21] T341348: In-progress Jenkins logs sometimes unavailable (HTTP ERROR 404 Not Found) - https://phabricator.wikimedia.org/T341348 [16:22:28] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:30] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:23:34] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2022 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:23:36] PROBLEM - Query Service HTTP Port on wdqs2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:23:50] ^ bking, sorry [16:23:57] but for awareness [16:23:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2022:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:24:03] hth in any way if I can [16:24:10] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:24:19] (SystemdUnitFailed) firing: (3) wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:20] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:24:36] PROBLEM - Check systemd state on wdqs2017 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://w [16:24:36] wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:46] PROBLEM - WDQS SPARQL on wdqs2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:04] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2017 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:08] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2022 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:19] (03PS1) 10Volans: quotereviewer: adapt to new Dell PDF format [software] - 10https://gerrit.wikimedia.org/r/936305 (https://phabricator.wikimedia.org/T341345) [16:28:24] (SystemdUnitFailed) resolved: (3) wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs2022:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:32:12] (03PS1) 10Jbond: pcc: dont recurse directories [puppet] - 10https://gerrit.wikimedia.org/r/936306 [16:36:44] (03CR) 10Jbond: [C: 03+2] pcc: dont recurse directories [puppet] - 10https://gerrit.wikimedia.org/r/936306 (owner: 10Jbond) [16:38:59] !log bking@cumin1001 conftool action : set/pooled=yes; selector: name=wdqs2020.codfw.wmnet [16:39:03] (03CR) 10Volans: [C: 03+2] "tested with all the PDFs in the task, self-merging to let Rob use it" [software] - 10https://gerrit.wikimedia.org/r/936305 (https://phabricator.wikimedia.org/T341345) (owner: 10Volans) [16:39:38] (03Merged) 10jenkins-bot: quotereviewer: adapt to new Dell PDF format [software] - 10https://gerrit.wikimedia.org/r/936305 (https://phabricator.wikimedia.org/T341345) (owner: 10Volans) [16:44:09] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1001.eqiad.wmnet with OS bullseye [16:44:22] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudlb1001.eqiad.wmnet wi... [16:44:59] (03CR) 10Volans: sre.mysql.clone: Only encrypt data transfers between DCs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup) [16:48:21] (03PS1) 10Jbond: puppet_compiler: ensure we link the yaml dir before initiating other dirs [puppet] - 10https://gerrit.wikimedia.org/r/936309 [16:49:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42355/console" [puppet] - 10https://gerrit.wikimedia.org/r/936309 (owner: 10Jbond) [16:51:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet_compiler: ensure we link the yaml dir before initiating other dirs [puppet] - 10https://gerrit.wikimedia.org/r/936309 (owner: 10Jbond) [16:59:36] (03PS1) 10Jbond: puppet_compiler: drop recurse. [puppet] - 10https://gerrit.wikimedia.org/r/936312 [17:01:01] (03CR) 10Jbond: [C: 03+2] puppet_compiler: drop recurse. [puppet] - 10https://gerrit.wikimedia.org/r/936312 (owner: 10Jbond) [17:05:47] (03PS1) 10RobH: updating R450 skus [software] - 10https://gerrit.wikimedia.org/r/936313 [17:07:20] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2017 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:09:36] 10SRE, 10TimedMediaHandler, 10serviceops: Upgrade Wikimedia production's ffmpeg to 4.4 or later so we can use the fpsmax flag - https://phabricator.wikimedia.org/T318419 (10TheDJ) BTW. it seems that stable is now at 5.1.3-1. Our current versions are: - MW servers: 4.1.11 - Thumbor: 3.2.18 - Docker: 4... [17:12:00] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:14:16] (03PS1) 10Btullis: Fix the datahub frontend authentication [deployment-charts] - 10https://gerrit.wikimedia.org/r/936314 (https://phabricator.wikimedia.org/T329514) [17:22:14] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:44] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:44] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Dzahn) 05In progress→03Resolved a:05CCoxwell-WMF→03Arnoldokoth Optimistically resolving it. If you run into any problems with this, just comment / reopen this ticket. Thank you [17:25:51] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Dzahn) [17:26:33] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10Dzahn) 05In progress→03Resolved Optimistically resolving it. If you run into any problems with this, just comment / reopen the ticket. Thank you! [17:26:37] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Dzahn) [17:28:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) 05Stalled→03Declined Alright! thanks. Well, then let's pick option b). I close this as Declined but it really means "just for now" and once you s... [17:30:20] (03PS1) 10Cory Massaro: Add AppArmor configuration for the deployed function-evaluator service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 [17:30:59] (03CR) 10CI reject: [V: 04-1] Add AppArmor configuration for the deployed function-evaluator service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (owner: 10Cory Massaro) [17:34:10] (03CR) 10Cory Massaro: "Thank you for taking a look!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (owner: 10Cory Massaro) [17:35:17] (03PS1) 10Jbond: puppet_compiler: load class { '::sslcert::dhparam': } [puppet] - 10https://gerrit.wikimedia.org/r/936318 [17:37:45] (03CR) 10Jbond: [C: 03+2] puppet_compiler: load class { '::sslcert::dhparam': } [puppet] - 10https://gerrit.wikimedia.org/r/936318 (owner: 10Jbond) [17:38:19] (03PS1) 10Andrew Bogott: radosgw: config tweaks [puppet] - 10https://gerrit.wikimedia.org/r/936319 (https://phabricator.wikimedia.org/T338937) [17:40:26] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:40:39] (03PS2) 10Andrew Bogott: radosgw: turn off implicit tenants. [puppet] - 10https://gerrit.wikimedia.org/r/936319 (https://phabricator.wikimedia.org/T338937) [17:42:01] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: turn off implicit tenants. [puppet] - 10https://gerrit.wikimedia.org/r/936319 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott) [17:45:02] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2018 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:48:44] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) [17:48:54] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [17:51:14] (03PS1) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 [17:52:27] (03PS1) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) [17:53:28] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Soda) >>! In T325607#8997709, @Nikki wrote: >>>! In T325607#8898528, @SCherukuwada wrote: >> @Soda Yeah Navboxes would indeed ha... [17:53:50] (03CR) 10CI reject: [V: 04-1] puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (owner: 10Jbond) [17:53:54] (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [17:54:31] (03CR) 10Dzahn: Add monitoring for mirrors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [17:55:01] (03CR) 10Btullis: [C: 03+2] Fix the datahub frontend authentication [deployment-charts] - 10https://gerrit.wikimedia.org/r/936314 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [17:55:51] (03Merged) 10jenkins-bot: Fix the datahub frontend authentication [deployment-charts] - 10https://gerrit.wikimedia.org/r/936314 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [17:55:57] (03PS2) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 [17:56:35] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [17:57:12] (03PS3) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 [17:57:14] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb1001.eqiad.wmnet with OS bullseye [17:57:25] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudlb1001.eqiad.wmnet with O... [17:58:27] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [17:59:13] (03CR) 10CI reject: [V: 04-1] puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (owner: 10Jbond) [18:00:39] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10BCornwall) I poked around the puppet facts for a few services and didn't really find anything descriptive. Wh... [18:00:49] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10BCornwall) 05Open→03Stalled [18:02:23] (03PS1) 10Btullis: Update datahub jaas volume name [deployment-charts] - 10https://gerrit.wikimedia.org/r/936324 (https://phabricator.wikimedia.org/T329514) [18:02:34] (03CR) 10BCornwall: [C: 03+2] pybal: Fix hostnames not being sent on alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [18:03:30] (03CR) 10Btullis: [C: 03+2] Update datahub jaas volume name [deployment-charts] - 10https://gerrit.wikimedia.org/r/936324 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:04:15] (03Merged) 10jenkins-bot: Update datahub jaas volume name [deployment-charts] - 10https://gerrit.wikimedia.org/r/936324 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:05:31] (03PS4) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 [18:06:17] (03CR) 10Jbond: "this is still work in progress but ultimately will like you to review so thought id ping early 😊" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (owner: 10Jbond) [18:07:02] (03CR) 10Jbond: "there are still some issues with paths, possibly related to the secret function. but surprisingly this is already looking good https://pu" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (owner: 10Jbond) [18:07:31] (03PS3) 10Dzahn: mirrors: Add monitoring for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [18:07:40] (03CR) 10CI reject: [V: 04-1] puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (owner: 10Jbond) [18:07:45] (03CR) 10CI reject: [V: 04-1] mirrors: Add monitoring for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [18:08:27] (03PS4) 10Dzahn: mirrors: Add monitoring for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [18:08:43] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [18:11:37] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [18:12:39] (03CR) 10Dzahn: contint: replace Apache 2.2 access control syntax for Jenkins proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [18:14:13] (03CR) 10Dzahn: "I would also be ok deploying this on my own if Antoine agrees. I was just going off the original comment where Eoghan was pinged to do it " [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [18:15:52] (03CR) 10Dzahn: [C: 03+1] "and thanks Jaime for the review:) You found out on the unrelated change how I was doing it wrong and RequireAll can't be repeated. But fro" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [18:17:21] (03PS5) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 [18:18:20] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:19:35] (03CR) 10CI reject: [V: 04-1] puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (owner: 10Jbond) [18:19:38] (03CR) 10Jbond: "and with the latest ps we are noop. https://puppet-compiler.wmflabs.org/output/936273/1/" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (owner: 10Jbond) [18:19:59] (03CR) 10Dzahn: "now critical means serious but NOT paging - when looking at modules/alertmanager/templates/alertmanager.yml.erb this should mean that "sre" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [18:21:40] (03CR) 10Dzahn: [C: 03+2] extdist: Remove pre-bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm) [18:22:07] (03PS1) 10Jbond: DO NOT MEREG: Ps to exercise PCC [puppet] - 10https://gerrit.wikimedia.org/r/936325 [18:22:25] (03CR) 10Jbond: [V: 04-1 C: 04-2] "do not merge" [puppet] - 10https://gerrit.wikimedia.org/r/936325 (owner: 10Jbond) [18:22:27] (03CR) 10Dzahn: [C: 03+2] "rebased - it was basically already done but some minor formatting left, merging" [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm) [18:22:30] (03CR) 10CI reject: [V: 04-1] DO NOT MEREG: Ps to exercise PCC [puppet] - 10https://gerrit.wikimedia.org/r/936325 (owner: 10Jbond) [18:22:56] (03CR) 10Dzahn: [C: 03+2] "also see compiler output shows noop and only a cloud instance" [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm) [18:24:38] (03CR) 10Dzahn: "@Legoktm +1 from you and I will merge that right now" [puppet] - 10https://gerrit.wikimedia.org/r/828057 (owner: 10Chad) [18:27:06] (03CR) 10Dzahn: [C: 03+1] "well, we agreed on just trying it and we already have a +1 from Filippo as well, what more can we ask for:) let me deploy it and keep an e" [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [18:27:49] (03CR) 10Dzahn: [C: 03+2] "severity is just "task" so worst that happens is some tickets to close again" [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [18:28:38] (03PS2) 10Jbond: DO NOT MEREG: Ps to exercise PCC [puppet] - 10https://gerrit.wikimedia.org/r/936325 [18:29:21] (03Merged) 10jenkins-bot: sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [18:35:32] (03CR) 10Jbond: "test run https://puppet-compiler.wmflabs.org/output/936325/1/" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (owner: 10Jbond) [18:38:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:41:19] (03CR) 10Dzahn: [C: 03+1] "So I think it's like "disable puppet, merge a bunch of changes in no particular order, enable puppet" actually. If puppet is disabled anyw" [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [18:42:11] (03CR) 10DCausse: [C: 03+1] "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris) [18:48:43] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [18:49:18] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) @SLyngshede-WMF this may be a more interesting one for you, let me know if you need more info [18:49:59] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [18:58:17] (03PS6) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) [19:00:27] (03CR) 10CI reject: [V: 04-1] puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) (owner: 10Jbond) [19:00:44] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [19:01:32] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) the puppetdb and puppetmaster ones are valid so i have left them, althugh we could probably change to just using `$facts['puppet_confi... [19:01:42] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) p:05Triage→03Medium [19:03:40] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): remove puppet::expose_agent_certs from puppetdb classes - https://phabricator.wikimedia.org/T341374 (10jbond) [19:05:01] (03CR) 10BCornwall: [C: 04-1] sre.cdn: move common functions to base class (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [19:09:17] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [19:12:10] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2017 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:12:15] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:21:46] 10SRE, 10Domains: Mark Monitor administration panel (redirects for wikimedia.pl) - https://phabricator.wikimedia.org/T333827 (10Dzahn) per Slack: Wikimedia Poland should be unblocked and is working on moving the page [19:32:59] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:33:04] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [19:33:59] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:41:24] (03PS1) 10Dzahn: planet: quoting, style guide fixes [puppet] - 10https://gerrit.wikimedia.org/r/936329 [19:42:48] (03PS1) 10Dzahn: planet: remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/936331 [19:44:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:49:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:56:04] 10SRE, 10TimedMediaHandler, 10serviceops: Upgrade Wikimedia production's ffmpeg to 4.4 or later so we can use the fpsmax flag - https://phabricator.wikimedia.org/T318419 (10brion) Note I've worked around this in the related cleanup on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+... [19:58:13] (03PS1) 10Andrew Bogott: Revert "radosgw: turn off implicit tenants." [puppet] - 10https://gerrit.wikimedia.org/r/936116 [19:58:45] (03CR) 10CI reject: [V: 04-1] Revert "radosgw: turn off implicit tenants." [puppet] - 10https://gerrit.wikimedia.org/r/936116 (owner: 10Andrew Bogott) [19:59:07] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [20:00:02] (03PS2) 10Andrew Bogott: Revert "radosgw: turn off implicit tenants." [puppet] - 10https://gerrit.wikimedia.org/r/936116 [20:01:45] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond) in the last migration we used a lua filter to strip out new keys we may be able to do the same as such here is an exampe of a working report P49531 [20:04:04] (03CR) 10Andrew Bogott: [C: 03+2] Revert "radosgw: turn off implicit tenants." [puppet] - 10https://gerrit.wikimedia.org/r/936116 (owner: 10Andrew Bogott) [20:12:29] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Ferien) >>! In T341097#8995420, @jhsoby wrote: > The spammers have now moved on from promoting that one IRC network to posting links and ASCII art depicting... [20:13:13] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10KFrancis) Hi all, Let me do some research and get back to you! Thanks!!! [20:13:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:20:39] (03PS1) 10Kimberly Sarabia: Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) [20:21:28] (03PS1) 10Jbond: puppetserver: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/936334 [20:22:28] (03CR) 10Jbond: [C: 03+2] puppetserver: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/936334 (owner: 10Jbond) [20:26:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:34:54] (03PS1) 10Dwisehaupt: Remove frpig1001 from dns, decommissioning [dns] - 10https://gerrit.wikimedia.org/r/936335 (https://phabricator.wikimedia.org/T340128) [20:36:03] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond) hmm i have now also seen an issue inserting a catalog ` 2023-07-07T20:32:19.700Z INFO [p.p.command] [16825972-1688761939638] [52 ms] 'replace f... [20:36:48] (03CR) 10Jdlrobson: [C: 03+1] Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [20:44:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1156.eqiad.wmnet with OS bullseye [20:44:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye [20:48:43] (03CR) 10Jgreen: [C: 03+2] Remove frpig1001 from dns, decommissioning [dns] - 10https://gerrit.wikimedia.org/r/936335 (https://phabricator.wikimedia.org/T340128) (owner: 10Dwisehaupt) [20:50:05] !log dwisehaupt@cumin1001 START - Cookbook sre.dns.netbox [20:52:36] (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [20:52:44] !log dwisehaupt@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1001" [20:53:27] !log dwisehaupt@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1001" [20:53:27] !log dwisehaupt@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:54:59] (PuppetDisabled) firing: Puppet disabled on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-test&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [20:55:12] PROBLEM - puppet last run on wdqs1010 is CRITICAL: CRITICAL: Puppet has been disabled for 604882 seconds, message: testing prom exporters - bking, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:57:36] (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [20:59:59] (PuppetDisabled) resolved: Puppet disabled on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-test&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [21:00:48] RECOVERY - puppet last run on wdqs1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:13:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:16:37] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "This seems to make sense. If I'm understanding correctly, this is using the internal PKI so it will only be trusted by internal servers wh" [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [21:18:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:22] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:23:54] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [21:24:26] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:24:52] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 57s) [21:25:18] RECOVERY - Query Service HTTP Port on wdqs2017 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:25:34] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2017 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:26:00] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2017 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:26:14] RECOVERY - WDQS SPARQL on wdqs2017 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.239 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:30:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt) @Jclark-ctr @papaul I just reverified that the host has connectivity, but is still in the wrong VLAN. ` Jul 7 21:27:44 frpm1002 dhcpd[4084378]: DHCPDISCOVER from... [21:38:40] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2017 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:39:16] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2017 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:59:43] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [22:04:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1156.eqiad.wmnet with OS bullseye [22:04:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye executed with errors: - an-wo... [22:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:08:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:13:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:13:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:17:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:18:20] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:18:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:21:15] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [22:32:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:37:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:41:19] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [22:53:34] (HelmReleaseBadStatus) firing: Helm release opentelemetry-collector/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:55:53] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [22:55:55] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [23:23:34] (HelmReleaseBadStatus) resolved: Helm release opentelemetry-collector/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:42:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded