[00:31:05] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930829 (owner: 10TrainBranchBot) [00:39:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930831 [00:39:35] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930831 (owner: 10TrainBranchBot) [00:47:10] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930831 (owner: 10TrainBranchBot) [01:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [02:10:04] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:02] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:41:09] (03PS5) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) [02:42:10] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [02:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [03:02:01] (03PS6) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) [03:03:00] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [03:11:05] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:16:49] (03PS7) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) [03:17:40] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [03:21:37] (03PS1) 10Andrew Bogott: cinder-backup: decrease block size and # of concurrent operations [puppet] - 10https://gerrit.wikimedia.org/r/930947 (https://phabricator.wikimedia.org/T339830) [03:23:10] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: decrease block size and # of concurrent operations [puppet] - 10https://gerrit.wikimedia.org/r/930947 (https://phabricator.wikimedia.org/T339830) (owner: 10Andrew Bogott) [04:04:22] (03PS8) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) [04:05:19] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [04:16:01] (03PS2) 10KartikMistry: Update MinT to 2023-06-16-042302-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930743 (https://phabricator.wikimedia.org/T339271) [04:21:09] (03PS9) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) [04:22:17] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [04:25:37] * kart_ deploying MinT [04:27:16] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-06-16-042302-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930743 (https://phabricator.wikimedia.org/T339271) (owner: 10KartikMistry) [04:27:57] (03Merged) 10jenkins-bot: Update MinT to 2023-06-16-042302-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930743 (https://phabricator.wikimedia.org/T339271) (owner: 10KartikMistry) [04:29:07] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [04:44:53] Looks like staging is stuck or taking too long to deploy. I'll hold on deployment to production. [04:49:20] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [04:52:02] Oh :) [04:54:12] Looks like staging is down for MinT. [04:59:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:01:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:04:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:05:33] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:05:48] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:06] Reattempting staging deployment.. [05:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:14:10] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:14:30] (03PS10) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) [05:15:30] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [05:16:34] (HelmReleaseBadStatus) resolved: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:22:04] I see no major changes in MinT, but - image: docker-registry.discovery.wmnet/envoy:1.18.3-2 --> + image: docker-registry.discovery.wmnet/envoy:1.18.3-2-s2 [05:34:21] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:36:40] OK. It failed again. [05:40:22] (03PS11) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) [05:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:41:07] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [05:44:26] (03PS12) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) [05:45:01] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [05:46:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:48:42] (03CR) 10Ayounsi: [C: 03+2] users: Replace vgutierrez RSA key with an ed25519 one [homer/public] - 10https://gerrit.wikimedia.org/r/929998 (https://phabricator.wikimedia.org/T336769) (owner: 10Vgutierrez) [05:48:50] (03CR) 10Ayounsi: [C: 03+2] cdanis: new ed25519 ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/930860 (https://phabricator.wikimedia.org/T336769) (owner: 10CDanis) [05:49:21] (03Merged) 10jenkins-bot: users: Replace vgutierrez RSA key with an ed25519 one [homer/public] - 10https://gerrit.wikimedia.org/r/929998 (https://phabricator.wikimedia.org/T336769) (owner: 10Vgutierrez) [05:49:24] (03Merged) 10jenkins-bot: cdanis: new ed25519 ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/930860 (https://phabricator.wikimedia.org/T336769) (owner: 10CDanis) [05:57:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [05:57:48] (03CR) 10Jameel Kaisar: "Country division is based on the following criteria:" [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [06:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:16:30] (03CR) 10Jameel Kaisar: Probenet: Configure NetworkProbeLimit to get adequate data for each country (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [06:29:20] (03PS1) 10Urbanecm: Add throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931053 [06:29:47] (03CR) 10Urbanecm: [C: 03+2] Add throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931053 (owner: 10Urbanecm) [06:30:28] (03PS2) 10Urbanecm: Add throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931053 [06:30:31] (03CR) 10CI reject: [V: 04-1] Add throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931053 (owner: 10Urbanecm) [06:30:39] (03CR) 10Urbanecm: [C: 03+2] Add throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931053 (owner: 10Urbanecm) [06:31:37] (03Merged) 10jenkins-bot: Add throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931053 (owner: 10Urbanecm) [06:32:23] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:931053|Add throttle rule]] [06:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:39:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:931053|Add throttle rule]] (duration: 07m 10s) [06:39:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:41:38] (03PS2) 10Alexandros Kosiaris: service::catalog: Depuplicate search service IPs [puppet] - 10https://gerrit.wikimedia.org/r/930175 [06:43:51] (ProbeDown) firing: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:46:21] (03PS1) 10Alexandros Kosiaris: admin: Add new ed25519 key for akosiaris [puppet] - 10https://gerrit.wikimedia.org/r/931055 (https://phabricator.wikimedia.org/T336769) [06:47:06] (03CR) 10CI reject: [V: 04-1] admin: Add new ed25519 key for akosiaris [puppet] - 10https://gerrit.wikimedia.org/r/931055 (https://phabricator.wikimedia.org/T336769) (owner: 10Alexandros Kosiaris) [06:48:00] (03PS1) 10Alexandros Kosiaris: users: Add ed25519 key for akosiaris [homer/public] - 10https://gerrit.wikimedia.org/r/931056 (https://phabricator.wikimedia.org/T336769) [06:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:57:07] (03PS2) 10KartikMistry: Use Parsoid for all Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) [06:58:39] (03PS1) 10Elukey: Add a new ssh key for elukey [homer/public] - 10https://gerrit.wikimedia.org/r/931057 (https://phabricator.wikimedia.org/T336769) [07:00:06] Amir1, Urbanecm, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T0700). [07:00:06] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] \0 [07:01:06] xSavitar: I'm going to deploy, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930744 - Heads up as you mentioned in the Gerrit. Followup patch can be reviewed, but since there is no train this week, no need to hurry. [07:01:14] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [07:03:02] (03Merged) 10jenkins-bot: Use Parsoid for all Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [07:03:20] !log kartik@deploy1002 Started scap: Backport for [[gerrit:930744|Use Parsoid for all Wikis for Content Translation (T339322)]] [07:03:24] T339322: Use Parsoid in all Wikis for Content Translation - https://phabricator.wikimedia.org/T339322 [07:04:41] !log kartik@deploy1002 kartik: Backport for [[gerrit:930744|Use Parsoid for all Wikis for Content Translation (T339322)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:07:39] (03PS1) 10Muehlenhoff: Remove access for demon [puppet] - 10https://gerrit.wikimedia.org/r/931059 [07:10:20] Testing looks good; going ahead.. [07:11:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1124.eqiad.wmnet with OS bookworm [07:12:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for demon [puppet] - 10https://gerrit.wikimedia.org/r/931059 (owner: 10Muehlenhoff) [07:14:23] (03PS1) 10Muehlenhoff: Remove LDAP access for prtksxna [puppet] - 10https://gerrit.wikimedia.org/r/931062 [07:14:52] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:930744|Use Parsoid for all Wikis for Content Translation (T339322)]] (duration: 11m 31s) [07:14:56] T339322: Use Parsoid in all Wikis for Content Translation - https://phabricator.wikimedia.org/T339322 [07:18:12] I'm done with backport. [07:19:27] (03PS1) 10Marostegui: control-mariadb-client-10.6-bookworm: Add to repo [software] - 10https://gerrit.wikimedia.org/r/931063 (https://phabricator.wikimedia.org/T339326) [07:21:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for prtksxna [puppet] - 10https://gerrit.wikimedia.org/r/931062 (owner: 10Muehlenhoff) [07:28:10] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.6-bookworm: Add to repo [software] - 10https://gerrit.wikimedia.org/r/931063 (https://phabricator.wikimedia.org/T339326) (owner: 10Marostegui) [07:28:43] (03Merged) 10jenkins-bot: control-mariadb-client-10.6-bookworm: Add to repo [software] - 10https://gerrit.wikimedia.org/r/931063 (https://phabricator.wikimedia.org/T339326) (owner: 10Marostegui) [07:38:46] !log uploaded wmfmariadbpy 0.10+deb12u1 [07:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:27] (03PS5) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [07:39:50] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1124.eqiad.wmnet with OS bookworm [07:40:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1124.eqiad.wmnet with OS bullseye [07:41:50] (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:43:55] (03PS6) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [07:46:26] (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:50:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1124.eqiad.wmnet with reason: host reimage [07:52:15] (03PS7) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [07:53:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1124.eqiad.wmnet with reason: host reimage [07:54:25] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Chad out of all services on: 1259 hosts [07:55:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Chad out of all services on: 1259 hosts [07:56:07] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:58:59] (PuppetDisabled) resolved: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:00:58] (03PS3) 10Clément Goubert: trafficserver: Send testwiki traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/930547 (https://phabricator.wikimedia.org/T337489) [08:01:07] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:02:46] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10Aklapper) [08:02:49] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10Aklapper) [08:03:38] 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10Aklapper) @joanna_borun I boldly edited the description at https://phabricator.wikimedia.org/project/manage/78/ - does that make sense? [08:03:44] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1124.eqiad.wmnet with OS bullseye [08:04:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1124.eqiad.wmnet with OS bookworm [08:06:07] (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:07:34] esams looks like it's suffering [08:12:14] kart_: Regarding your MinT deployment and envoy, I updated the global envoy image version, but the only change is an added script in the image. [08:14:43] claime: Logs aren't much helpful after deployment failure, so I'm not sure what's wrong. akosiaris can you look when around? [08:15:21] (03PS1) 10Ladsgroup: moveToExternal: First decompress gziped entries before iconv [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930925 (https://phabricator.wikimedia.org/T128150) [08:15:50] (03CR) 10Ladsgroup: [C: 03+2] moveToExternal: First decompress gziped entries before iconv [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930925 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [08:17:57] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41768/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:18:16] kart_: I can take a quick look [08:19:00] claime: Thanks. Should I paste logs somewhere? [08:19:29] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Chad out of all services on: 776 hosts [08:19:43] kart_: Have you checked out https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting ? [08:19:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Chad out of all services on: 776 hosts [08:20:20] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Chad out of all services on: 19 hosts [08:20:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Chad out of all services on: 19 hosts [08:20:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:21:28] kart_: your app pod fails connecting to https://peopleweb.discovery.wmnet/~santhosh/nllb/nllb200-600M/config.json apparently [08:21:42] ah. [08:22:11] Found through `kubectl logs machinetranslation-staging-774779654c-qjtjx machinetranslation-staging` [08:25:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:26:47] kart_: you're missing egress for people [08:27:35] claime: not sure if we change anything recently. [08:27:47] Not sure it's great to get your config from people though [08:28:10] why is a production service trying to reach a people home dir of a specific person in the first place? [08:28:16] taavi: Agreed. [08:28:25] is today a no-deploy day (Juneteenth) or not? I see all the usual windows in the deployment calendar… [08:28:43] claime: That's going to deprecated. [08:29:34] claime: https://phabricator.wikimedia.org/T335491 [08:29:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3051.esams.wmnet [08:29:49] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3050.esams.wmnet [08:30:10] (03PS1) 10Jbond: idp_test: add juniper information [puppet] - 10https://gerrit.wikimedia.org/r/931065 [08:30:27] !log rebooting cp3051 and cp3051 for kernel upgrade (T335835) [08:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:00] (03CR) 10Jbond: [C: 03+2] idp_test: add juniper information [puppet] - 10https://gerrit.wikimedia.org/r/931065 (owner: 10Jbond) [08:31:32] (03PS3) 10Jbond: ferm: Allow passing the port is a more structured way (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/930656 (owner: 10Muehlenhoff) [08:33:07] (ProbeDown) firing: (3) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:33:26] claime: I think egress for people was added in e03159c1555570ab4d42aa933df8279b0d4d5087 - anything changed after that? [08:33:40] kart_: checking [08:33:59] (03PS4) 10ArielGlenn: Fix up more things in the README for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/928605 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [08:34:07] (ProbeDown) firing: (4) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:34:13] helmfile.d/services/machinetranslation/values.yaml - should have it. [08:34:13] kart_: Yeah, people servers changed [08:34:19] :/ [08:34:23] templates/wmnet:people 5M IN CNAME people1004.eqiad.wmnet. [08:34:25] templates/wmnet:people 5M IN CNAME people2003.codfw.wmnet. [08:34:52] (03Merged) 10jenkins-bot: moveToExternal: First decompress gziped entries before iconv [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930925 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [08:35:04] (03CR) 10ArielGlenn: [C: 03+2] Fix up more things in the README for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/928605 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [08:35:24] kart_: Changed in Ibc56d2f36f6ba3a7b45c9ae955eb34673b34234f [08:36:07] (03CR) 10Jbond: [C: 03+1] Provided a dedicated KDC logrotate config and fix service reload [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [08:36:18] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930925|moveToExternal: First decompress gziped entries before iconv (T128150)]] [08:36:22] T128150: Stop needing to use wgLegacyEncoding in Wikimedia cluster production - https://phabricator.wikimedia.org/T128150 [08:36:26] (03PS1) 10Ladsgroup: blocked domains: Make sure users can't bypass the list by using uppercase [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931066 (https://phabricator.wikimedia.org/T337431) [08:36:32] (03CR) 10Ladsgroup: [C: 03+2] blocked domains: Make sure users can't bypass the list by using uppercase [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931066 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [08:37:25] kart_: Since the egress existed already, and will be deprecated, I'll update the ips in your service [08:37:39] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930925|moveToExternal: First decompress gziped entries before iconv (T128150)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [08:37:43] (03PS6) 10ArielGlenn: Modify the global blocks script to accept output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861 (owner: 10Hokwelum) [08:38:04] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3051.esams.wmnet [08:38:07] (ProbeDown) firing: (4) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:38:20] claime: kart_: is your work related to this alert ^^^ [08:38:35] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3050.esams.wmnet [08:38:42] jbond: To? text:80 ? Can't see how [08:39:00] ack thanks [08:41:20] (03PS8) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [08:41:42] jbond: tell me if I need to drop it and help on the page [08:42:31] claime: looks like tis calming down seems like we had a blip in esams [08:42:34] https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1&from=now-5m&to=now [08:42:42] ack [08:43:07] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:43:15] don't know if it's related but I'm depooling/rebooting/repooling hosts in esams [08:43:33] for T335835 [08:43:38] hmm spoke to soon [08:43:42] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41771/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:43:50] fabfur: ahh yes that could be related [08:44:07] (ProbeDown) firing: (4) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:44:35] let me know if I should stop the cookbook [08:44:47] fabfur: does the start of the reboots match with the network probes failing? [08:44:52] claime: How can I get IPs of updated people servers, so I can fix issue? Never done that work :) [08:45:10] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930925|moveToExternal: First decompress gziped entries before iconv (T128150)]] (duration: 08m 52s) [08:45:14] T128150: Stop needing to use wgLegacyEncoding in Wikimedia cluster production - https://phabricator.wikimedia.org/T128150 [08:45:26] reboot start procedure started at 08.29 UTC [08:45:29] (03PS1) 10Clément Goubert: machinetranslation: Update people egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) [08:45:31] kart_: I've done the CR, but for future reference, netbox [08:45:41] fabfur: that is about right can you stop [08:45:44] claime: Noted. [08:45:53] jbond: sure [08:46:03] fabfur: ack seems matching, IIUC you ran the reboot single right? [08:46:05] We might need to increase the maxconns for port 80 in haproxy [08:46:09] yes [08:46:12] ack [08:46:22] super, let's see if it recovers stopping the reboots [08:46:27] it theory it should [08:46:38] (03PS1) 10Ladsgroup: Temporarily bring back legacy encoding in four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931087 (https://phabricator.wikimedia.org/T128150) [08:46:50] fabfur: cheers [08:47:08] (03PS2) 10Ladsgroup: Temporarily bring back legacy encoding in four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931087 (https://phabricator.wikimedia.org/T128150) [08:47:46] the 2 hosts has been repooled at 08.38 UTC [08:48:07] (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:48:24] ack seems related [08:49:07] (ProbeDown) resolved: (4) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:11] 500's also gone down https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-30m&to=now&viewPanel=63 [08:49:21] (03PS7) 10ArielGlenn: Modify the global blocks dumps script to permit override of default output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [08:49:22] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1124.eqiad.wmnet with OS bookworm [08:49:49] (03CR) 10Clément Goubert: [C: 03+2] machinetranslation: Update people egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) (owner: 10Clément Goubert) [08:50:04] Hmmm the 500s on the applayer aren't related at all with exams reboots [08:50:11] *esams [08:50:18] (03CR) 10Ladsgroup: [C: 03+2] Temporarily bring back legacy encoding in four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931087 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [08:50:29] hmm i didn;t look deepply but the time line seemed to match [08:50:50] (03Merged) 10jenkins-bot: machinetranslation: Update people egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) (owner: 10Clément Goubert) [08:51:13] (03Merged) 10jenkins-bot: Temporarily bring back legacy encoding in four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931087 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [08:51:21] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [08:53:42] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931087|Temporarily bring back legacy encoding in four wikis (T128150)]] [08:53:45] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [08:53:46] T128150: Stop needing to use wgLegacyEncoding in Wikimedia cluster production - https://phabricator.wikimedia.org/T128150 [08:55:12] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931087|Temporarily bring back legacy encoding in four wikis (T128150)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:55:53] (03Merged) 10jenkins-bot: blocked domains: Make sure users can't bypass the list by using uppercase [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931066 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [08:55:55] claime: staging data seems appearing. Thanks a lot! [08:56:34] (HelmReleaseBadStatus) resolved: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:56:42] jbond: a small spike could be related to connections getting closed abruptly. Take into account that port 80 never hits the applayer [08:57:41] (03CR) 10KartikMistry: "Thanks @claime!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) (owner: 10Clément Goubert) [08:58:45] kart_: np :) [08:58:53] claime: Should I also go ahead with eqiad/codfw deployment or should we wait for sometime? [08:59:12] kart_: No, I think you're fine [08:59:22] cool. [08:59:39] vgutierrez: ack thanks [09:00:43] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [09:01:13] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931087|Temporarily bring back legacy encoding in four wikis (T128150)]] (duration: 07m 31s) [09:01:17] T128150: Stop needing to use wgLegacyEncoding in Wikimedia cluster production - https://phabricator.wikimedia.org/T128150 [09:01:29] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) Seeing many errors like this: ` Jun 19 09:00:07 cloudservices2004-dev pdns_server[1181224]: Received NOTIFY for codfw1... [09:02:04] (03PS1) 10Ladsgroup: Blocked domains: Fix removing a domain via the special page [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931067 (https://phabricator.wikimedia.org/T337431) [09:02:11] (03CR) 10Ladsgroup: [C: 03+2] Blocked domains: Fix removing a domain via the special page [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931067 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [09:02:19] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931066|blocked domains: Make sure users can't bypass the list by using uppercase (T337431)]] [09:02:23] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [09:03:41] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931066|blocked domains: Make sure users can't bypass the list by using uppercase (T337431)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [09:06:04] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [09:07:52] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [09:07:58] (03PS8) 10ArielGlenn: Modify the global blocks dumps script to permit override of default output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [09:09:41] (03CR) 10Jbond: ferm: Allow passing the port is a more structured way (WIP) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/930656 (owner: 10Muehlenhoff) [09:10:00] (03CR) 10ArielGlenn: "This patch has been tested on snapshot1009 (our testbed) and works as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/928861 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [09:10:17] (03CR) 10ArielGlenn: [C: 03+2] Modify the global blocks dumps script to permit override of default output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [09:11:17] someone's netbox changes are up to be merged, I skipped them and just did my ones [09:11:33] (via puppet-merge) [09:12:12] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931066|blocked domains: Make sure users can't bypass the list by using uppercase (T337431)]] (duration: 09m 53s) [09:12:16] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [09:12:41] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [09:13:32] (03PS2) 10ArielGlenn: make snapshot101[67] temporary testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/930671 (owner: 10Hokwelum) [09:14:33] (03PS9) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [09:15:03] !log Updated MinT to 2023-06-16-042302-production, Updated people egress (T339271, T335491) [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:08] T335491: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 [09:15:08] T339271: MinT translates to Hindi when English-Santali is selected - https://phabricator.wikimedia.org/T339271 [09:15:21] (03PS2) 10Jbond: admin: Add new ed25519 key for akosiaris [puppet] - 10https://gerrit.wikimedia.org/r/931055 (https://phabricator.wikimedia.org/T336769) (owner: 10Alexandros Kosiaris) [09:15:55] (03CR) 10Kamila Součková: [C: 03+1] trafficserver: Send testwiki traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/930547 (https://phabricator.wikimedia.org/T337489) (owner: 10Clément Goubert) [09:18:22] (03PS3) 10ArielGlenn: make snapshot101[67] temporary testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/930671 (owner: 10Hokwelum) [09:21:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931067 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [09:21:08] (03CR) 10ArielGlenn: [C: 03+2] make snapshot101[67] temporary testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/930671 (owner: 10Hokwelum) [09:21:09] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1058.eqiad.wmnet [09:21:16] (03Merged) 10jenkins-bot: Blocked domains: Fix removing a domain via the special page [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931067 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [09:21:36] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931067|Blocked domains: Fix removing a domain via the special page (T337431)]] [09:21:39] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [09:22:57] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931067|Blocked domains: Fix removing a domain via the special page (T337431)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [09:24:37] (03PS1) 10Jbond: config/common: add new key for jbond [homer/public] - 10https://gerrit.wikimedia.org/r/931229 (https://phabricator.wikimedia.org/T336769) [09:27:42] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [09:28:46] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/929737 (https://phabricator.wikimedia.org/T337972) (owner: 10Jbond) [09:29:36] (03PS1) 10Ladsgroup: Enable new spam block page in all wikis except meta, commons, wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931231 (https://phabricator.wikimedia.org/T337431) [09:30:01] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931067|Blocked domains: Fix removing a domain via the special page (T337431)]] (duration: 08m 24s) [09:30:04] (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [09:30:05] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [09:30:14] jouncebot: nowandnext [09:30:14] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [09:30:14] In 0 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1000) [09:30:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928665 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [09:30:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:30:43] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1058.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [09:31:00] (03CR) 10Ladsgroup: [C: 03+2] Enable new spam block page in all wikis except meta, commons, wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931231 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [09:31:16] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [09:32:07] (03Merged) 10jenkins-bot: Enable new spam block page in all wikis except meta, commons, wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931231 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [09:32:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931231 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [09:32:36] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) >>! In T338778#8945972, @aborrero wrote: > Seeing many errors like this: > > ` > Jun 19 09:00:07 cloudservices2004-dev... [09:32:38] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931231|Enable new spam block page in all wikis except meta, commons, wikidata (T337431)]] [09:33:58] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931231|Enable new spam block page in all wikis except meta, commons, wikidata (T337431)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [09:34:11] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [09:34:16] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:35:26] (03PS10) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [09:35:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:35:39] (03PS1) 10Muehlenhoff: d-i: Fix retrieval of reuse-parts.sh for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/931232 (https://phabricator.wikimedia.org/T339835) [09:36:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] trafficserver: Send testwiki traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/930547 (https://phabricator.wikimedia.org/T337489) (owner: 10Clément Goubert) [09:37:24] (03PS1) 10ArielGlenn: add snapshot10014-7 to the scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/931233 [09:37:43] (03CR) 10Marostegui: [C: 03+1] "haha tricky one" [puppet] - 10https://gerrit.wikimedia.org/r/931232 (https://phabricator.wikimedia.org/T339835) (owner: 10Muehlenhoff) [09:38:22] (03CR) 10Jbond: [C: 03+2] "validated via gchat" [puppet] - 10https://gerrit.wikimedia.org/r/931055 (https://phabricator.wikimedia.org/T336769) (owner: 10Alexandros Kosiaris) [09:39:00] (03CR) 10Jbond: [C: 03+2] "validated via gchat" [homer/public] - 10https://gerrit.wikimedia.org/r/931056 (https://phabricator.wikimedia.org/T336769) (owner: 10Alexandros Kosiaris) [09:39:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41776/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:40:32] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1058.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [09:40:32] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:40:32] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1058.eqiad.wmnet [09:41:17] (03PS11) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [09:43:05] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41777/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:43:23] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931231|Enable new spam block page in all wikis except meta, commons, wikidata (T337431)]] (duration: 10m 45s) [09:43:27] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [09:43:32] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: service: prevent chicken-egg problems with puppetmaster FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931235 [09:43:43] (03CR) 10Muehlenhoff: [C: 03+2] d-i: Fix retrieval of reuse-parts.sh for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/931232 (https://phabricator.wikimedia.org/T339835) (owner: 10Muehlenhoff) [09:43:59] (03CR) 10CI reject: [V: 04-1] openstack: designate: service: prevent chicken-egg problems with puppetmaster FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931235 (owner: 10Arturo Borrero Gonzalez) [09:44:35] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] add snapshot10014-7 to the scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/931233 (owner: 10ArielGlenn) [09:45:09] (03PS12) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [09:47:21] (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: prevent chicken-egg problems with puppetmaster FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931235 [09:47:46] (03CR) 10CI reject: [V: 04-1] openstack: designate: prevent chicken-egg problems with puppetmaster FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931235 (owner: 10Arturo Borrero Gonzalez) [09:47:54] PROBLEM - mediawiki-installation DSH group on snapshot1016 is CRITICAL: Host snapshot1016 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:48:26] (03CR) 10Slyngshede: "This patch is "step one" as simply betting that the script works can be slightly scary. Initially simply copy the new script and generate " [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:49:27] (03PS3) 10Arturo Borrero Gonzalez: openstack: designate: prevent chicken-egg problems with puppetmaster FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931235 [09:51:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1124.eqiad.wmnet with OS bookworm [09:59:36] (03PS1) 10Jbond: wmflib: Add new function to convert from a netmask to cidr [puppet] - 10https://gerrit.wikimedia.org/r/931236 (https://phabricator.wikimedia.org/T336864) [09:59:38] (03PS1) 10Jbond: interface::alias: update define to get prefix len from netmask [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1000) [10:00:05] claime: A patch you scheduled for MediaWiki infrastucture (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:15] (03CR) 10CI reject: [V: 04-1] wmflib: Add new function to convert from a netmask to cidr [puppet] - 10https://gerrit.wikimedia.org/r/931236 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [10:00:21] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: Send testwiki traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/930547 (https://phabricator.wikimedia.org/T337489) (owner: 10Clément Goubert) [10:00:43] !log Switching test.wikipedia.org to mw-on-k8s - T337489 [10:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:48] T337489: Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 [10:01:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1124.eqiad.wmnet with reason: host reimage [10:04:02] (03PS2) 10Jbond: wmflib: Add new function to convert from a netmask to cidr [puppet] - 10https://gerrit.wikimedia.org/r/931236 (https://phabricator.wikimedia.org/T336864) [10:04:04] (03PS2) 10Jbond: interface::alias: update define to get prefix len from netmask [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) [10:04:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1124.eqiad.wmnet with reason: host reimage [10:06:45] 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10MatthewVernon) I'm afraid not (unless there's a thanos setup in beta); you could spin one up in a pontoon stack, but that might be more work than you wanted! [10:07:13] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10hnowlan) This has subsided as a result of T337649#8938960 - however this behaviour is a side effect of the work required in T338297 [10:09:56] (03CR) 10Volans: "post-merge question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/930165 (https://phabricator.wikimedia.org/T339121) (owner: 10Jbond) [10:11:47] (03CR) 10Jbond: "pcc https://puppet-compiler.wmflabs.org/output/931237/41778/" [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [10:14:11] PROBLEM - mediawiki-installation DSH group on snapshot1017 is CRITICAL: Host snapshot1017 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:15:36] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [10:15:40] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:15:57] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: also skip failed hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/930165 (https://phabricator.wikimedia.org/T339121) (owner: 10Jbond) [10:16:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P49445 and previous config saved to /var/cache/conftool/dbconfig/20230619-101623-ladsgroup.json [10:16:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 10%: Maint over (T338354)', diff saved to https://phabricator.wikimedia.org/P49446 and previous config saved to /var/cache/conftool/dbconfig/20230619-101653-ladsgroup.json [10:16:57] T338354: db1135 has crashed - https://phabricator.wikimedia.org/T338354 [10:16:58] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [10:17:01] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:17:21] 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Ladsgroup) The data check didn't bring any difference. Repooling [10:18:46] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:19:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10Volans) The problem is that we can't use the exit code of the NOOP run because it can be both ok and not ok with the same non-zero exit... [10:30:00] (03PS4) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [10:30:17] (03PS2) 10Jbond: sre.__init__: add min_grace_sleep class param [cookbooks] - 10https://gerrit.wikimedia.org/r/930205 [10:30:24] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:30:43] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/930205 (owner: 10Jbond) [10:30:48] (03PS15) 10Jbond: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [10:31:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: Maint over (T338354)', diff saved to https://phabricator.wikimedia.org/P49447 and previous config saved to /var/cache/conftool/dbconfig/20230619-103157-ladsgroup.json [10:32:02] T338354: db1135 has crashed - https://phabricator.wikimedia.org/T338354 [10:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:16] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Volans) That was changed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/924342 without modifying spicerack although it's written o... [10:35:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931235 (owner: 10Arturo Borrero Gonzalez) [10:35:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: designate: prevent chicken-egg problems with puppetmaster FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931235 (owner: 10Arturo Borrero Gonzalez) [10:36:05] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/930644 (https://phabricator.wikimedia.org/T339243) (owner: 10Clément Goubert) [10:39:47] (03CR) 10Clément Goubert: [C: 03+2] service: Make lvs[monitors] optional in ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/930644 (https://phabricator.wikimedia.org/T339243) (owner: 10Clément Goubert) [10:40:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10jbond) > The problem is that we can't use the exit code of the NOOP run because it can be both ok and not ok with the same non-zero exit... [10:41:59] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Clement_Goubert) I figured. I merged the change, tell me when you cut a release and we can resolve. [10:43:21] (03CR) 10Ilias Sarantopoulos: changeprop: set wiki_id match config for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:43:27] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:43:35] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:43:51] (ProbeDown) firing: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:09] (03Merged) 10jenkins-bot: service: Make lvs[monitors] optional in ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/930644 (https://phabricator.wikimedia.org/T339243) (owner: 10Clément Goubert) [10:47:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: Maint over (T338354)', diff saved to https://phabricator.wikimedia.org/P49448 and previous config saved to /var/cache/conftool/dbconfig/20230619-104702-ladsgroup.json [10:47:07] T338354: db1135 has crashed - https://phabricator.wikimedia.org/T338354 [10:47:29] (03CR) 10Jbond: ferm: Allow passing the port is a more structured way (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:48:27] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1059.eqiad.wmnet [10:49:41] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:49:49] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:50:38] (03CR) 10Abijeet Patro: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/928159 (https://phabricator.wikimedia.org/T323192) (owner: 10Abijeet Patro) [10:52:53] !log imported megacli and ssacli to thirdparty/hwraid for bookworm-wikimedia T339847 [10:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:57] T339847: megacli missing on bookworm - https://phabricator.wikimedia.org/T339847 [10:54:57] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: No response from remote host 185.15.58.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:55:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1124.eqiad.wmnet with OS bookworm [10:56:02] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [10:56:23] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:58:12] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1059.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [10:58:52] (03PS1) 10Arturo Borrero Gonzalez: dnsrecursor: introduce query-local-address parameter [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) [10:59:21] (03PS1) 10Ladsgroup: Set externallinks migration to read new everywhere except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931240 (https://phabricator.wikimedia.org/T335343) [10:59:39] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1059.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [10:59:39] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:59:39] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1059.eqiad.wmnet [10:59:56] (03CR) 10Jbond: [V: 03+1 C: 04-1] "-1: this doesn't help as the client sends no SNI. SNI needs to be disabled in envoy see https://gerrit.wikimedia.org/r/c/operations/puppe" [puppet] - 10https://gerrit.wikimedia.org/r/930185 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [11:00:03] (03Abandoned) 10Jbond: promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930185 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [11:01:27] (03PS1) 10Majavah: Ensure service catalog schema matches spicerack release [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) [11:01:29] (03PS2) 10Ladsgroup: Set externallinks migration to read new everywhere except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931240 (https://phabricator.wikimedia.org/T335343) [11:02:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: Maint over (T338354)', diff saved to https://phabricator.wikimedia.org/P49449 and previous config saved to /var/cache/conftool/dbconfig/20230619-110207-ladsgroup.json [11:02:11] T338354: db1135 has crashed - https://phabricator.wikimedia.org/T338354 [11:02:39] (03PS1) 10Muehlenhoff: aptrepo: Rename repo sync config used for megacli and apply to bookworm as well [puppet] - 10https://gerrit.wikimedia.org/r/931242 (https://phabricator.wikimedia.org/T339847) [11:03:49] (03CR) 10Marostegui: [C: 03+1] aptrepo: Rename repo sync config used for megacli and apply to bookworm as well [puppet] - 10https://gerrit.wikimedia.org/r/931242 (https://phabricator.wikimedia.org/T339847) (owner: 10Muehlenhoff) [11:05:04] (03CR) 10CI reject: [V: 04-1] Ensure service catalog schema matches spicerack release [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) (owner: 10Majavah) [11:05:58] (03PS2) 10Majavah: Ensure service catalog schema matches spicerack release [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) [11:09:35] (03CR) 10CI reject: [V: 04-1] Ensure service catalog schema matches spicerack release [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) (owner: 10Majavah) [11:10:10] (03CR) 10Clément Goubert: "This needs to be merged after the spicerack release fixing ServiceLVS is cut, else it will break." [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) (owner: 10Majavah) [11:12:29] (03PS11) 10Jbond: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [11:13:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41780/console" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [11:14:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41781/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [11:18:00] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930833 [11:21:00] (03PS12) 10Jbond: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [11:22:35] (03CR) 10Clément Goubert: "According to the jenkins logs, the test environment installs wikimedia-spicerack==5.0.2 which raises" [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) (owner: 10Majavah) [11:26:19] (03PS13) 10Jbond: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [11:26:21] (03PS1) 10Jbond: releases: switch releases to use git~::clone checkout method [puppet] - 10https://gerrit.wikimedia.org/r/931256 (https://phabricator.wikimedia.org/T290260) [11:28:34] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1060.eqiad.wmnet [11:28:45] (03CR) 10Majavah: [C: 04-1] "Is there a specific reason to do this for `ensure => latest` only? I'd still expect Puppet to manage the repository settings when `present" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [11:29:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41785/console" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [11:29:48] 10SRE, 10Content-Transform-Team-WIP, 10RESTBase, 10RESTbase Sunsetting, and 6 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10MSantos) [11:32:01] (03PS2) 10Arturo Borrero Gonzalez: dnsrecursor: introduce query-local-address parameter [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) [11:32:17] (03PS7) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) [11:34:09] (03PS1) 10Muehlenhoff: Update sync definition for HP raid modules for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/931258 (https://phabricator.wikimedia.org/T339847) [11:34:27] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Rename repo sync config used for megacli and apply to bookworm as well [puppet] - 10https://gerrit.wikimedia.org/r/931242 (https://phabricator.wikimedia.org/T339847) (owner: 10Muehlenhoff) [11:36:01] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [11:36:37] (03PS11) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [11:36:44] (03CR) 10Marostegui: Update sync definition for HP raid modules for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931258 (https://phabricator.wikimedia.org/T339847) (owner: 10Muehlenhoff) [11:37:48] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/931239/41786/" [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:38:24] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1060.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [11:39:55] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1060.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [11:39:55] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:39:56] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1060.eqiad.wmnet [11:41:13] (03CR) 10Muehlenhoff: Update sync definition for HP raid modules for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931258 (https://phabricator.wikimedia.org/T339847) (owner: 10Muehlenhoff) [11:43:21] (03PS2) 10Muehlenhoff: Update sync definition for HP raid modules for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/931258 (https://phabricator.wikimedia.org/T339847) [11:46:02] (03CR) 10Marostegui: [C: 03+1] Update sync definition for HP raid modules for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/931258 (https://phabricator.wikimedia.org/T339847) (owner: 10Muehlenhoff) [11:48:08] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/931239/41789/" [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:48:16] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:48:28] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) 05Stalled→03In progress [11:49:46] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:53:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:53:50] (03PS1) 10Jelto: gitlab: remove gitlab_default_can_create_group setting [puppet] - 10https://gerrit.wikimedia.org/r/931259 (https://phabricator.wikimedia.org/T338460) [11:53:51] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:53:56] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:54:36] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:54:40] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:55:18] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:55:20] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:55:49] (03CR) 10Muehlenhoff: [C: 03+2] Update sync definition for HP raid modules for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/931258 (https://phabricator.wikimedia.org/T339847) (owner: 10Muehlenhoff) [11:56:07] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41790/console" [puppet] - 10https://gerrit.wikimedia.org/r/931259 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [11:58:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:53] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) >>! In T338778#8946041, @aborrero wrote: > Fixed by running this in the pdns database; > > ` > update domains set mast... [12:00:31] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:00:35] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:01:12] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:01:16] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:04:19] (03PS1) 10KartikMistry: Enable Content and Section Translation for a 3rd group of 10 languages previously lacking MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931260 (https://phabricator.wikimedia.org/T337834) [12:17:14] (03PS14) 10Jbond: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [12:17:16] (03PS1) 10Jbond: git: update spec test [puppet] - 10https://gerrit.wikimedia.org/r/931261 [12:17:16] 10SRE, 10Infrastructure-Foundations, 10netops: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 (10cmooney) p:05Triage→03Medium [12:17:28] (03PS1) 10Cathal Mooney: Update border-in firewall filter to set DSCP bits to DE [homer/public] - 10https://gerrit.wikimedia.org/r/931262 (https://phabricator.wikimedia.org/T339850) [12:20:47] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) 05Open→03Resolved We're using this controller for quite a while now, closing the task. [12:21:00] !log uploaded wmfmariadbpy 0.10+deb12u1 T339835 [12:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:04] T339835: Install Debian Bookworm on a DB - https://phabricator.wikimedia.org/T339835 [12:22:48] (03PS15) 10Jbond: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [12:23:46] (03PS1) 10Cathal Mooney: Add ferm rule to mark all server traffic as DSCP 0 [puppet] - 10https://gerrit.wikimedia.org/r/931263 (https://phabricator.wikimedia.org/T339850) [12:23:56] (03PS3) 10Arturo Borrero Gonzalez: dnsrecursor: introduce query-local-address parameter [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) [12:24:02] (03CR) 10Jbond: git::clone: Handle changes to origin URL and/or branch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [12:24:44] (03CR) 10Jbond: [C: 03+2] git: update spec test [puppet] - 10https://gerrit.wikimedia.org/r/931261 (owner: 10Jbond) [12:26:01] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "This PCC is better because is a NOOP for all servers except cloud ones:" [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [12:29:53] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [12:30:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] dnsrecursor: introduce query-local-address parameter [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [12:34:29] (03PS1) 10Elukey: profile::pki::root_ca: add intermediate for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/931264 (https://phabricator.wikimedia.org/T288470) [12:38:09] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [12:39:13] (03CR) 10Jbond: Ensure service catalog schema matches spicerack release (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) (owner: 10Majavah) [12:39:33] (03PS16) 10Jbond: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [12:40:48] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) >>! In T338778#8946438, @cmooney wrote: >>>! In T338778#8946041, @aborrero wrote: >> Fixed by running this in the pdns... [12:43:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (as long as ml-cache gets updated to 4.1.1 before merging :-)" [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) (owner: 10Eevans) [12:50:18] (03CR) 10Jbond: "couple of fly by comments" [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [12:51:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931264 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:54:03] (03CR) 10Elukey: [C: 03+2] profile::pki::root_ca: add intermediate for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/931264 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:54:06] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: remove gitlab_default_can_create_group setting [puppet] - 10https://gerrit.wikimedia.org/r/931259 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [12:57:52] 10SRE, 10Infrastructure-Foundations, 10netops: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) p:05Triage→03Medium [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1300) [13:00:05] albertoleoncio and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] Hi! [13:00:22] I’m here but https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar says “no deploys Mon 19 Jun”? [13:00:32] (but everyone else seems to be deploying like normal…) [13:00:54] (03PS1) 10Elukey: profile::pkie::intermediates: add the cassandra public certificate [puppet] - 10https://gerrit.wikimedia.org/r/931267 (https://phabricator.wikimedia.org/T288470) [13:02:02] (03PS1) 10Ladsgroup: Revert "db1135: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/931072 [13:02:25] (03PS2) 10Ladsgroup: Revert "db1135: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/931072 [13:02:25] Idk... ¯_(ツ)_/¯ [13:02:31] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1135: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/931072 (owner: 10Ladsgroup) [13:02:38] hi [13:03:03] oh, oops. it's always the holidays, ugh [13:03:50] (03PS2) 10Jbond: profile::pki::intermediates: add the cassandra public certificate [puppet] - 10https://gerrit.wikimedia.org/r/931267 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:03:53] to me both of the changes don’t look very urgent, so I would postpone them both [13:04:02] (03PS1) 10Ladsgroup: file: Make pre-gen rendering of multi-page files (pdf, ...) serial [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931073 (https://phabricator.wikimedia.org/T337649) [13:04:04] (03PS3) 10Elukey: profile::pki::intermediates: add the cassandra public certificate [puppet] - 10https://gerrit.wikimedia.org/r/931267 (https://phabricator.wikimedia.org/T288470) [13:04:09] (03CR) 10Jbond: [C: 03+1] "LGTM (i just fixed the typo in the commit)" [puppet] - 10https://gerrit.wikimedia.org/r/931267 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:04:11] i can move my stuff to tomorrow [13:04:25] jbond: ahahah you were quicker than me :) [13:04:54] my understanding was that we don't have a train this week but backports should be fine [13:05:09] it's only US holiday [13:05:17] (03PS1) 10Elukey: pki: add fake cassandra intermediate key [labs/private] - 10https://gerrit.wikimedia.org/r/931269 [13:05:17] MatmaRex: btw, dewiki is at 1223500 out of 7510732 rows, frwiki 1437600 out of 11904854 [13:05:26] (03CR) 10Elukey: [V: 03+2 C: 03+2] pki: add fake cassandra intermediate key [labs/private] - 10https://gerrit.wikimedia.org/r/931269 (owner: 10Elukey) [13:05:32] Lucas_WMDE: nice, thanks [13:05:51] group 2 is at nlwiki and group3 at hiwiki [13:06:58] the last time someone designated a "no deploys" day for a US holiday and then didn't update the calendar so no one knew about it, they were quite upset about me scheduling things :/ [13:07:45] i think it was the week of thanksgiving (not even the thursday) [13:09:56] albertoleoncio: should i move yours to tomorrow too? [13:10:27] (03CR) 10Elukey: [C: 03+2] profile::pki::intermediates: add the cassandra public certificate [puppet] - 10https://gerrit.wikimedia.org/r/931267 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:10:49] (03PS2) 10Kamila Součková: add discovery records for rest-gateway and device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/930631 (https://phabricator.wikimedia.org/T335505) (owner: 10Hnowlan) [13:11:08] (03CR) 10Kamila Součková: [C: 03+1] add discovery records for rest-gateway and device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/930631 (https://phabricator.wikimedia.org/T335505) (owner: 10Hnowlan) [13:11:33] Well... I really like to deploy this today =/ [13:12:12] But if it is not possible, that's fine [13:12:47] (03CR) 10Kamila Součková: [C: 03+2] add discovery records for rest-gateway and device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/930631 (https://phabricator.wikimedia.org/T335505) (owner: 10Hnowlan) [13:12:59] Lucas_WMDE: do you want to do albertoleoncio's config patch then? [13:13:13] i'll move my thing and go complain somewhere [13:14:13] PROBLEM - puppet last run on kafka-test1006 is CRITICAL: CRITICAL: Puppet has been disabled for 605053 seconds, message: Elukey - elukey, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:14:32] !log installing openjdk-17 security updates [13:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:46] MatmaRex: I was just about to suggest that the yearly calendar should be machine-readable, and the calendar creation scripts should use it [13:15:49] (03PS1) 10DCausse: token_count_router: infer the analyzer from the field (followup) [extensions/CirrusSearch] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931271 (https://phabricator.wikimedia.org/T339810) [13:17:02] MatmaRex, albertoleoncio: I’m not deploying anything today unless it’s super urgent [13:17:08] (03PS1) 10Elukey: role::pki::multirootca: add the cassandra intermediate [puppet] - 10https://gerrit.wikimedia.org/r/931272 (https://phabricator.wikimedia.org/T288470) [13:17:13] fair [13:17:21] taavi: sounds like a good idea [13:17:39] fair tmt [13:18:36] (03CR) 10Jbond: [C: 03+1] role::pki::multirootca: add the cassandra intermediate [puppet] - 10https://gerrit.wikimedia.org/r/931272 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:18:57] (03CR) 10Elukey: [C: 03+2] role::pki::multirootca: add the cassandra intermediate [puppet] - 10https://gerrit.wikimedia.org/r/931272 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:19:47] RECOVERY - puppet last run on kafka-test1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:20:05] (03CR) 10Ssingh: "Thanks for ensuring NOOP on prod DNS hosts! One important think that we should fix sooner than later is mentioned in-line:" [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [13:25:18] (03CR) 10Stevemunene: [C: 03+2] analytics: Decommission analytics106[1-3] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930580 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [13:26:37] (03PS5) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [13:27:01] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:31:44] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:31:47] (03PS6) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [13:32:27] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:33:24] PROBLEM - Check systemd state on analytics1061 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:30] PROBLEM - Check systemd state on analytics1062 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:18] PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:34:33] (03PS7) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [13:34:34] PROBLEM - Check systemd state on analytics1063 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:56] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:37:19] (03PS1) 10Jbond: puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) [13:39:05] (03PS8) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [13:39:30] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:39:32] (03PS2) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [13:39:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41793/console" [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:40:22] (03CR) 10Elukey: [C: 03+2] role::cache::{text,upload}: move vk instances to PKI in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/930633 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:44:17] !log updated DNS: added discovery records for rest-gateway and device-analytics T335505 [13:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:24] T335505: Figure out what's outstanding to have device-analytics serving 100% Production data - https://phabricator.wikimedia.org/T335505 [13:47:39] (03PS9) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [13:47:56] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 46 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:48:19] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:48:59] (03PS1) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [13:50:14] (03PS10) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [13:50:38] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:51:50] (03PS3) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [13:52:46] (03PS1) 10Papaul: Add ed25519 key for Papaul [homer/public] - 10https://gerrit.wikimedia.org/r/931277 (https://phabricator.wikimedia.org/T336769) [13:53:19] (03PS11) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [13:53:20] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:53:46] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:55:56] (03PS1) 10Elukey: role::ml_cache:storage: move internode settings to 'all' [puppet] - 10https://gerrit.wikimedia.org/r/931278 (https://phabricator.wikimedia.org/T339300) [13:56:32] jouncebot: nowandnext [13:56:32] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1300) [13:56:32] In 1 hour(s) and 33 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1530) [13:57:58] (03CR) 10Ladsgroup: [C: 03+2] file: Make pre-gen rendering of multi-page files (pdf, ...) serial [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931073 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [14:00:27] (03PS1) 10Raymond Ndibe: Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) [14:00:59] (03CR) 10CI reject: [V: 04-1] Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [14:01:17] (03CR) 10Cathal Mooney: [C: 03+1] Add ed25519 key for Papaul [homer/public] - 10https://gerrit.wikimedia.org/r/931277 (https://phabricator.wikimedia.org/T336769) (owner: 10Papaul) [14:02:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/929951 (owner: 10Slyngshede) [14:02:58] (03CR) 10Papaul: [C: 03+2] Add ed25519 key for Papaul [homer/public] - 10https://gerrit.wikimedia.org/r/931277 (https://phabricator.wikimedia.org/T336769) (owner: 10Papaul) [14:04:01] (03PS2) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [14:04:32] (03PS2) 10Raymond Ndibe: Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) [14:04:47] !log move varnishafka instances in eqsin to PKI [14:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:59] (03CR) 10CI reject: [V: 04-1] Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [14:05:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41795/console" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:05:15] (03PS1) 10FNegri: cumin: Increase connect_timeout for slow servers [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [14:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:31] (03PS3) 10Raymond Ndibe: Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) [14:12:02] (03CR) 10CI reject: [V: 04-1] Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [14:14:37] (03PS4) 10Raymond Ndibe: Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) [14:15:03] (03CR) 10CI reject: [V: 04-1] Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [14:17:17] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:17:34] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:17:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:47] (03Merged) 10jenkins-bot: file: Make pre-gen rendering of multi-page files (pdf, ...) serial [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931073 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [14:18:54] (03PS5) 10Raymond Ndibe: Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) [14:19:20] (03CR) 10CI reject: [V: 04-1] Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [14:19:24] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]] [14:19:28] T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 [14:19:42] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment switch MariaDB driver. [puppet] - 10https://gerrit.wikimedia.org/r/929951 (owner: 10Slyngshede) [14:20:48] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:21:39] (03CR) 10AikoChou: changeprop: set wiki_id match config for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:21:51] (03PS4) 10AikoChou: changeprop: set wiki_id match config for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) [14:23:06] (03CR) 10Majavah: [C: 04-1] "Fighting the CI seems rather pointless here.. you need to update the config to the archived preset and force merge after that." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [14:23:48] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:23:51] (ProbeDown) resolved: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#idm2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:03] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: switchdc SAL log entries are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709 (10Volans) p:05Triage→03Low [14:24:12] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:24:28] (03CR) 10Ilias Sarantopoulos: [C: 03+1] changeprop: set wiki_id match config for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:26:26] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:26:48] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:26:58] (03PS6) 10Raymond Ndibe: Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) [14:27:09] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:27:16] (03CR) 10CI reject: [V: 04-1] Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [14:27:26] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:27:28] (03PS12) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [14:27:52] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:29:26] 10SRE-tools, 10Observability-Logging, 10Spicerack: Create a cookbook for managing Logstash cluster restarts - https://phabricator.wikimedia.org/T293929 (10joanna_borun) [14:29:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [14:29:47] (03PS1) 10Hnowlan: rest-gateway: add hostname with port [deployment-charts] - 10https://gerrit.wikimedia.org/r/931281 [14:30:01] (03PS13) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [14:30:15] (03PS3) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [14:33:41] (03CR) 10AikoChou: changeprop: set wiki_id match config for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:38:17] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:38:24] 10SRE, 10Infrastructure-Foundations, 10netops: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10ayounsi) Not tested but looks like the syntax changed slightly to: ` set forwarding-options enhanced-hash-key inet ? Possible completions: + apply-grou... [14:38:35] (03PS14) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [14:39:32] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]] (duration: 20m 07s) [14:39:36] T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 [14:42:00] (03PS1) 10Snwachukwu: Test Refine_sanitize migration to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931284 [14:42:36] (03CR) 10Muehlenhoff: ferm: Allow passing the port is a more structured way (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:42:46] (03PS1) 10Arturo Borrero Gonzalez: dnsrecursor: follow up on query-local-address changes [puppet] - 10https://gerrit.wikimedia.org/r/931285 [14:43:53] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] dnsrecursor: introduce query-local-address parameter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:44:23] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10joanna_borun) [14:44:30] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10SLyngshede-WMF) 05In progress→03Resolved [14:44:56] (03CR) 10Muehlenhoff: "Partial PCC for the modified new type https://puppet-compiler.wmflabs.org/output/930656/41797/ (will also run a full PCC run to catch pote" [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:45:41] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:45:43] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:46:04] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10Volans) @JMeybohm am I interpreting correctly that you're saying that those are raising an exception because the reboot or puppe... [14:46:35] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:46:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:47:29] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:47:47] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:47:52] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:48:17] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:50:35] (03PS2) 10Snwachukwu: Test Refine_sanitize migration to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931284 (https://phabricator.wikimedia.org/T335308) [14:50:40] (03CR) 10Arturo Borrero Gonzalez: [V: 04-1] "PCC fails for cloud hosts, NOOP for wiki hosts https://puppet-compiler.wmflabs.org/output/931285/41798/" [puppet] - 10https://gerrit.wikimedia.org/r/931285 (owner: 10Arturo Borrero Gonzalez) [14:51:38] (03PS1) 10Jelto: sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339843) [14:51:45] (03PS2) 10Arturo Borrero Gonzalez: dnsrecursor: follow up on query-local-address changes [puppet] - 10https://gerrit.wikimedia.org/r/931285 [14:51:50] (03CR) 10Snwachukwu: [C: 03+1] Test Refine_sanitize migration to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931284 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [14:53:17] (03PS2) 10Jelto: sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) [14:54:23] (03PS3) 10Arturo Borrero Gonzalez: dnsrecursor: follow up on query-local-address changes [puppet] - 10https://gerrit.wikimedia.org/r/931285 [14:55:16] (03CR) 10CI reject: [V: 04-1] sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [14:55:39] (03CR) 10Muehlenhoff: [C: 03+2] Provided a dedicated KDC logrotate config and fix service reload [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [14:56:49] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/931285/41800/" [puppet] - 10https://gerrit.wikimedia.org/r/931285 (owner: 10Arturo Borrero Gonzalez) [14:57:35] (03CR) 10Ssingh: [C: 03+1] "Thanks for the quick fix, looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/931285 (owner: 10Arturo Borrero Gonzalez) [14:58:19] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] dnsrecursor: follow up on query-local-address changes [puppet] - 10https://gerrit.wikimedia.org/r/931285 (owner: 10Arturo Borrero Gonzalez) [15:00:23] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Volans) p:05Triage→03High [15:00:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [15:00:55] (03CR) 10Raymond Ndibe: "all open patches has been migrated to the new repo:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [15:01:30] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10jbond) > SGTM, the ability to only log successful executions would be a win for not impactful cookbooks. @ayounsi i think i see you have updated the networkin... [15:02:58] RECOVERY - Check systemd state on analytics1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:39] (03CR) 10David Caro: [V: 03+2 C: 03+2] Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [15:03:56] (03CR) 10CI reject: [V: 04-1] Moved to gitlab [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/931279 (https://phabricator.wikimedia.org/T331335) (owner: 10Raymond Ndibe) [15:06:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [15:07:21] !log installing libxpm security updates [15:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:28] (03PS1) 10Arturo Borrero Gonzalez: cloud: codfw1dev: fix labsldapconfig to use newer server [puppet] - 10https://gerrit.wikimedia.org/r/931287 (https://phabricator.wikimedia.org/T338778) [15:10:36] (03CR) 10Btullis: [C: 03+2] Test Refine_sanitize migration to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931284 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [15:11:02] (03PS1) 10Muehlenhoff: Add library hint for libxpm [puppet] - 10https://gerrit.wikimedia.org/r/931289 [15:12:03] (03PS2) 10Muehlenhoff: Add library hint for libxpm [puppet] - 10https://gerrit.wikimedia.org/r/931289 [15:12:42] (CertAlmostExpired) firing: (4) Certificate for service miscweb1003:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:13:58] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy NLLB model [deployment-charts] - 10https://gerrit.wikimedia.org/r/931290 (https://phabricator.wikimedia.org/T333861) [15:14:23] 10SRE, 10Infrastructure-Foundations: makevm cookbook should remove VMs if OS install fails - https://phabricator.wikimedia.org/T338986 (10ssingh) [15:14:42] (03CR) 10Joal: Test Refine_sanitize migration to spark3. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931284 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [15:16:04] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi) I still think it would be valuable to be able to not log anything without hack. right now the cookbook fails with a success during a show operation. [15:16:39] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libxpm [puppet] - 10https://gerrit.wikimedia.org/r/931289 (owner: 10Muehlenhoff) [15:17:42] (CertAlmostExpired) firing: (6) Certificate for service miscweb1003:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:22:23] !log Rolling reboot of codfw cache_text nodes to apply Linux update for CVE-2023-1872 - T335835 [15:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:59] (03CR) 10BCornwall: [C: 03+2] sre.__init__: add min_grace_sleep class param [cookbooks] - 10https://gerrit.wikimedia.org/r/930205 (owner: 10Jbond) [15:25:26] (03PS4) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [15:25:34] (03Merged) 10jenkins-bot: sre.__init__: add min_grace_sleep class param [cookbooks] - 10https://gerrit.wikimedia.org/r/930205 (owner: 10Jbond) [15:26:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41801/console" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [15:27:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh ldap hosts [puppet] - 10https://gerrit.wikimedia.org/r/931291 (https://phabricator.wikimedia.org/T338778) [15:27:55] (03PS2) 10Cathal Mooney: Update border-in firewall filter to set DSCP bits to DE [homer/public] - 10https://gerrit.wikimedia.org/r/931262 (https://phabricator.wikimedia.org/T339850) [15:29:47] (03CR) 10Jbond: "minor issue with doc string but otherwise lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:30:04] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1530) [15:31:47] (03CR) 10Jbond: Create a CDN host reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:31:59] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:33:32] (03PS1) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 [15:34:46] (03PS1) 10Elukey: Add fake password for the ml-cache's keystore [labs/private] - 10https://gerrit.wikimedia.org/r/931293 [15:35:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake password for the ml-cache's keystore [labs/private] - 10https://gerrit.wikimedia.org/r/931293 (owner: 10Elukey) [15:36:08] (03PS2) 10Hnowlan: rest-gateway: add hostname with port [deployment-charts] - 10https://gerrit.wikimedia.org/r/931281 [15:36:59] (03PS2) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 [15:37:25] (03CR) 10CI reject: [V: 04-1] role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (owner: 10Elukey) [15:38:11] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10MatthewVernon) @akosiaris sure; do you have opinions on what a good usename would look like for this use case? [15:40:00] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10akosiaris) >>! In T335491#8947044, @MatthewVernon wrote: > @akosiaris sure; do you have opinions on what a good u... [15:40:22] (03PS5) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [15:40:24] (03PS3) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 [15:41:31] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41803/console" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (owner: 10Elukey) [15:43:27] (03PS6) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [15:43:29] (03PS4) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 [15:44:09] (03PS1) 10Jbond: dnsrecursor: use an array for query_local_address [puppet] - 10https://gerrit.wikimedia.org/r/931295 [15:44:19] (03CR) 10Elukey: [C: 03+2] role::ml_cache:storage: move internode settings to 'all' [puppet] - 10https://gerrit.wikimedia.org/r/931278 (https://phabricator.wikimedia.org/T339300) (owner: 10Elukey) [15:44:38] (03CR) 10Jbond: "feel free to -1 if this is not desirable" [puppet] - 10https://gerrit.wikimedia.org/r/931295 (owner: 10Jbond) [15:44:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41804/console" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (owner: 10Elukey) [15:45:13] (03CR) 10Jbond: "thanks arturo, i have sent a minor update https://gerrit.wikimedia.org/r/c/operations/puppet/+/931295" [puppet] - 10https://gerrit.wikimedia.org/r/931239 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:45:53] (03PS16) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [15:46:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/931295 (owner: 10Jbond) [15:46:16] (03CR) 10BCornwall: Create a CDN host reboot cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:46:38] RECOVERY - Check systemd state on analytics1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:42] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Applying internode-encryption: all - elukey@cumin1001 [15:48:21] (03CR) 10Elukey: "Very high level idea, lemme know ;)" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [15:48:38] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10jbond) p:05Triage→03Medium >>! In T324655#8947003, @ayounsi wrote: > I still think it would be valuable to be able to not log anything without hack. right... [15:50:41] (03PS17) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [15:50:49] (03PS1) 10MVernon: thanos: add machinetranslation user [labs/private] - 10https://gerrit.wikimedia.org/r/931296 (https://phabricator.wikimedia.org/T335491) [15:50:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:50:59] !log elukey@cumin1001 END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching A:ml-cache-codfw: Applying internode-encryption: all - elukey@cumin1001 [15:51:00] (03PS1) 10MVernon: profile::thanos::swift: add machinetranslation user [puppet] - 10https://gerrit.wikimedia.org/r/931297 (https://phabricator.wikimedia.org/T335491) [15:51:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41805/console" [puppet] - 10https://gerrit.wikimedia.org/r/931295 (owner: 10Jbond) [15:52:27] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931297 (https://phabricator.wikimedia.org/T335491) (owner: 10MVernon) [15:53:28] (03CR) 10BCornwall: [C: 03+2] Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:55:46] (03Merged) 10jenkins-bot: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:55:51] (03Abandoned) 10AikoChou: ml-services: update revert-risk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/930613 (owner: 10AikoChou) [15:56:36] (03CR) 10Hnowlan: trafficserver: route proton requests via the API gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [15:57:08] (03PS1) 10Elukey: role::ml_cache::storage: upgrade to Cassandra 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931298 (https://phabricator.wikimedia.org/T339300) [15:58:05] (03PS2) 10Elukey: role::ml_cache::storage: upgrade to Cassandra 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931298 (https://phabricator.wikimedia.org/T339300) [15:58:20] (03PS1) 10AikoChou: ml-services: update revert-risk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/931299 [15:58:53] (03CR) 10Elukey: [C: 03+1] ml-services: update revert-risk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/931299 (owner: 10AikoChou) [15:59:43] (03CR) 10Jbond: [C: 03+1] "lgtm couple of nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:59:49] (03CR) 10Elukey: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/931290 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [16:00:41] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/931299 (owner: 10AikoChou) [16:01:25] (03Merged) 10jenkins-bot: ml-services: update revert-risk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/931299 (owner: 10AikoChou) [16:04:46] (03CR) 10Ssingh: [C: 03+1] "Thanks for this improvement, looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/931295 (owner: 10Jbond) [16:05:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] dnsrecursor: use an array for query_local_address [puppet] - 10https://gerrit.wikimedia.org/r/931295 (owner: 10Jbond) [16:05:40] (03PS1) 10Papaul: Change Rob's key so fixing it to my key [homer/public] - 10https://gerrit.wikimedia.org/r/931301 (https://phabricator.wikimedia.org/T336769) [16:06:53] (03CR) 10Papaul: [C: 03+2] Change Rob's key so fixing it to my key [homer/public] - 10https://gerrit.wikimedia.org/r/931301 (https://phabricator.wikimedia.org/T336769) (owner: 10Papaul) [16:07:39] (03CR) 10Volans: cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [16:09:29] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:15:57] 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10joanna_borun) @Aklapper looks good to me. Thank you. [16:16:45] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:18:24] 10SRE, 10Infrastructure-Foundations: makevm cookbook should remove VMs if OS install fails - https://phabricator.wikimedia.org/T338986 (10Volans) p:05Triage→03Medium a:03Volans I think we can just make the `rollback()` return if the VM has been created and not rollback anything. The user can then run the... [16:19:50] (03PS1) 10Ladsgroup: Revert "Temporarily bring back legacy encoding in four wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931078 [16:20:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:22:34] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:23:27] (03PS1) 10Volans: sre.ganeti.makevm: skip rollback in some cases [cookbooks] - 10https://gerrit.wikimedia.org/r/931302 (https://phabricator.wikimedia.org/T338986) [16:24:54] (03PS2) 10Ladsgroup: Revert "Temporarily bring back legacy encoding in four wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931078 [16:25:04] jouncebot: nowandnext [16:25:04] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [16:25:04] In 0 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1700) [16:25:04] In 0 hour(s) and 34 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1700) [16:25:12] (03CR) 10Ladsgroup: [C: 03+2] Revert "Temporarily bring back legacy encoding in four wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931078 (owner: 10Ladsgroup) [16:25:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:25:59] (03Merged) 10jenkins-bot: Revert "Temporarily bring back legacy encoding in four wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931078 (owner: 10Ladsgroup) [16:26:32] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931078|Revert "Temporarily bring back legacy encoding in four wikis"]] [16:27:02] (03PS3) 10Ilias Sarantopoulos: ml-services: deploy NLLB model [deployment-charts] - 10https://gerrit.wikimedia.org/r/931290 (https://phabricator.wikimedia.org/T333861) [16:27:23] (03PS4) 10Ilias Sarantopoulos: ml-services: deploy NLLB model [deployment-charts] - 10https://gerrit.wikimedia.org/r/931290 (https://phabricator.wikimedia.org/T333861) [16:27:55] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931078|Revert "Temporarily bring back legacy encoding in four wikis"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [16:41:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/931302 (https://phabricator.wikimedia.org/T338986) (owner: 10Volans) [16:41:51] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931078|Revert "Temporarily bring back legacy encoding in four wikis"]] (duration: 15m 19s) [16:42:21] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: skip rollback in some cases [cookbooks] - 10https://gerrit.wikimedia.org/r/931302 (https://phabricator.wikimedia.org/T338986) (owner: 10Volans) [16:44:41] (03CR) 10FNegri: cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [16:44:43] (03CR) 10Ayounsi: [C: 03+2] config/common: add new key for jbond [homer/public] - 10https://gerrit.wikimedia.org/r/931229 (https://phabricator.wikimedia.org/T336769) (owner: 10Jbond) [16:44:50] (03CR) 10Ayounsi: [C: 03+2] Add a new ssh key for elukey [homer/public] - 10https://gerrit.wikimedia.org/r/931057 (https://phabricator.wikimedia.org/T336769) (owner: 10Elukey) [16:45:11] (03Merged) 10jenkins-bot: sre.ganeti.makevm: skip rollback in some cases [cookbooks] - 10https://gerrit.wikimedia.org/r/931302 (https://phabricator.wikimedia.org/T338986) (owner: 10Volans) [16:45:19] (03Merged) 10jenkins-bot: config/common: add new key for jbond [homer/public] - 10https://gerrit.wikimedia.org/r/931229 (https://phabricator.wikimedia.org/T336769) (owner: 10Jbond) [16:45:23] (03Merged) 10jenkins-bot: Add a new ssh key for elukey [homer/public] - 10https://gerrit.wikimedia.org/r/931057 (https://phabricator.wikimedia.org/T336769) (owner: 10Elukey) [16:49:05] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Volans) The ideal solution would be to make the spicerack class accept happily any undefined parameter, the only problem with that is that a... [16:49:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [16:50:34] (03CR) 10Volans: "my 2 cents inline" [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) (owner: 10Majavah) [16:52:29] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: makevm cookbook should remove VMs if OS install fails - https://phabricator.wikimedia.org/T338986 (10Volans) 05Open→03Resolved @Dzahn with the above patch merged the issue should be solved. Feel free to reopen if it doesn't. [16:58:21] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikidiff2 for TheresNoTime - https://phabricator.wikimedia.org/T338948 (10ssingh) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1700) [17:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T1700). [17:00:13] (03PS1) 10Ssingh: admin: add samtar to releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/931303 (https://phabricator.wikimedia.org/T338948) [17:01:23] (03CR) 10Ssingh: [C: 03+2] admin: add samtar to releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/931303 (https://phabricator.wikimedia.org/T338948) (owner: 10Ssingh) [17:02:00] (03CR) 10Ayounsi: Add ferm rule to mark all server traffic as DSCP 0 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931263 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [17:02:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikidiff2 for TheresNoTime - https://phabricator.wikimedia.org/T338948 (10ssingh) 05Open→03Resolved a:03ssingh Request merged, please re-open if there are any issues, thanks! [17:04:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10ssingh) [17:09:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10ssingh) [17:11:30] (03CR) 10Ssingh: [C: 03+2] admin: Add zabe to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/928803 (https://phabricator.wikimedia.org/T337703) (owner: 10Zabe) [17:13:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10ssingh) 05Open→03Resolved Thanks for the patch accompanying the task. Merged; please re-open if there are any issues. Thanks! [17:14:55] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Logging, 10Spicerack: Create a cookbook for managing Logstash cluster restarts - https://phabricator.wikimedia.org/T293929 (10lmata) [17:16:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::thanos::swift: add machinetranslation user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931297 (https://phabricator.wikimedia.org/T335491) (owner: 10MVernon) [17:16:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] thanos: add machinetranslation user [labs/private] - 10https://gerrit.wikimedia.org/r/931296 (https://phabricator.wikimedia.org/T335491) (owner: 10MVernon) [17:16:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10ssingh) For posterity: Stalled on the bullseye upgrade. [17:29:04] 10SRE, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10ssingh) 05Open→03Resolved a:03ssingh aklapper added to the gerritadmin LDAP group. Please reopen if there are issues. Thanks! [17:32:22] 10SRE, 10Gerrit, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10ssingh) 05Open→03Resolved a:03ssingh Removed @Dzahn as per his request and added @Jelto. [17:35:05] (03PS1) 10Ladsgroup: Stop setting wgLegacyEncdoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) [17:36:50] (03CR) 10CI reject: [V: 04-1] Stop setting wgLegacyEncdoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [17:37:25] (03CR) 10Ladsgroup: "What do you prefer, should we keep the default value in CommonSettings.php, IS.php, or somewhere else?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [17:38:25] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10ssingh) a:03ssingh Thanks @KFrancis! This still needs someone from WMF to sponsor this request. [17:38:36] (03PS2) 10Ladsgroup: Stop setting wgLegacyEncdoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) [17:43:38] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 142 probes of 706 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:50:04] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 49 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:50:37] 10SRE, 10Traffic: Create a cookbook to reboot CDN hosts - https://phabricator.wikimedia.org/T338813 (10BCornwall) 05In progress→03Resolved [17:54:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 63 probes of 706 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:59:09] (03Restored) 10BCornwall: Add mastadon.wikimedia.org domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [18:01:08] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:09:20] (03PS2) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:09:46] (03CR) 10CI reject: [V: 04-1] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:14:14] (03PS1) 10Gmodena: mw-page-content-change-enrich: HA in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/931307 (https://phabricator.wikimedia.org/T338233) [18:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:21:13] (03PS3) 10BCornwall: Add wikimedia.social domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) [18:21:49] (03PS4) 10BCornwall: Add wikimedia.social domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) [18:25:32] (03PS3) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:25:55] (03CR) 10CI reject: [V: 04-1] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:29:47] (03PS4) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:30:11] (03CR) 10CI reject: [V: 04-1] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:32:04] (03CR) 10Ssingh: Add wikimedia.social domain (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [18:32:38] 10SRE, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10Aklapper) Thanks! [18:34:53] (03Abandoned) 10Aklapper: Enable FileExporter for Gov Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469042 (https://phabricator.wikimedia.org/T207502) (owner: 10Varnent) [18:35:26] (03PS5) 10BCornwall: Add wikimedia.social domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) [18:35:31] (03CR) 10BCornwall: Add wikimedia.social domain (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [18:38:38] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:39:07] (03PS5) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:39:30] (03CR) 10CI reject: [V: 04-1] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:41:36] (03PS6) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:43:53] (03CR) 10CI reject: [V: 04-1] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:46:18] (03PS7) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:48:41] (03CR) 10CI reject: [V: 04-1] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:50:33] (03PS8) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:52:36] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:52:36] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:52:57] (03PS9) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:52:59] (03CR) 10CI reject: [V: 04-1] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:56:13] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10Reedy) All good with me! [18:56:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [18:57:13] (03CR) 10Ssingh: Add wikimedia.social domain (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [18:57:55] (03PS10) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [18:58:52] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:58:52] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:59:40] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 49 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:01:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [19:01:47] (03PS11) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [19:03:32] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:03:34] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:08] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:38] (03PS12) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [19:06:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41820/console" [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [19:11:30] (Processor usage over 85%) firing: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [19:17:42] (CertAlmostExpired) firing: (6) Certificate for service miscweb1003:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:26:30] (Processor usage over 85%) resolved: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [19:36:36] (03PS1) 10David Martin: Add '.' to events prefix & remove leading '/' from schema_title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931316 (https://phabricator.wikimedia.org/T336722) [19:55:04] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 34 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:56:48] (03PS6) 10BCornwall: Add wikimedia.social domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) [19:56:52] (03CR) 10BCornwall: Add wikimedia.social domain (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T2000) [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:11:49] (03CR) 10Ssingh: Add wikimedia.social domain (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:13:04] (03PS7) 10BCornwall: Add wikimedia.social domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) [20:13:34] (03CR) 10BCornwall: Add wikimedia.social domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:16:09] (03CR) 10Ssingh: [C: 03+1] "Thanks for working on the patch to add this zone!" [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:16:26] (03CR) 10BCornwall: [C: 03+2] Add wikimedia.social domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:26:50] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10neriah) @Aklapper Is there a way to speed up the treatment? This task went unanswered for a long time. [20:50:13] (03PS20) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [21:00:05] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230619T2100) [21:01:16] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10ssingh) a:05cmooney→03ssingh [21:09:44] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10TheDJ) >>! In T339102#8931346, @valerio.bozzolan wrote: > I would like to help @TheDJ but I didn't understand the "new user" in what. Was vikipedia already using maps.wikimedia.org or do they want to start... [21:15:41] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Aklapper) @neriah: [Not that I knew.](https://www.mediawiki.org/wiki/Bug_management/Development_prioritization) Maybe this could... [21:22:09] (03Abandoned) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [22:03:20] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10valerio.bozzolan) I can share that Vikidia was an user of Wikimedia Maps but ~3 years ago the maps stopped working for them. [22:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:38:38] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:58:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:54] (03PS3) 10Ryan Kemper: service::catalog: Deduplicate search service IPs [puppet] - 10https://gerrit.wikimedia.org/r/930175 (owner: 10Alexandros Kosiaris) [22:59:13] (03CR) 10Ryan Kemper: [C: 03+1] "Looks great! Thanks for doing this" [puppet] - 10https://gerrit.wikimedia.org/r/930175 (owner: 10Alexandros Kosiaris) [23:03:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:17:42] (CertAlmostExpired) firing: (6) Certificate for service miscweb1003:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:31:59] jouncebot: nowandnext [23:31:59] No deployments scheduled for the next 2 hour(s) and 28 minute(s) [23:31:59] In 2 hour(s) and 28 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T0200)