[00:38:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/934459 [00:38:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/934459 (owner: 10TrainBranchBot) [00:59:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/934459 (owner: 10TrainBranchBot) [02:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:10] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:00:48] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:21:16] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:32:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:33:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:28] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jhathaway) @jbond I spent quite a bit of time on this Friday, but also came up empty handed. I suspected some str... [04:24:28] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:45:32] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:52:25] (03PS1) 10TChin: Bump stream versions in mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/934719 (https://phabricator.wikimedia.org/T340746) [04:59:52] (03PS1) 10KartikMistry: Update cxserver to 2023-07-03-045311-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934720 (https://phabricator.wikimedia.org/T285217) [05:17:10] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:36:33] (03PS1) 10Sohom Datta: Enable edit-in-sequence in Italian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934723 (https://phabricator.wikimedia.org/T340847) [05:38:04] (03CR) 10Samwilson: [C: 03+1] Enable edit-in-sequence in Italian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934723 (https://phabricator.wikimedia.org/T340847) (owner: 10Sohom Datta) [05:38:16] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:07:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:09:52] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:12:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:20:24] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:27:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:40] 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10Joe) 05In progress→03Resolved [06:30:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe) [06:58:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:05] Amir1, Urbanecm, and taavi: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T0700). [07:00:05] thedj, Sohom_Datta, and Kizule: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] o/ [07:00:22] \o [07:02:20] o/ I can deploy [07:02:41] Great! [07:02:45] dj's backport is already deployed it seems [07:03:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934723 (https://phabricator.wikimedia.org/T340847) (owner: 10Sohom Datta) [07:03:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:03:45] (03Merged) 10jenkins-bot: Enable edit-in-sequence in Italian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934723 (https://phabricator.wikimedia.org/T340847) (owner: 10Sohom Datta) [07:03:49] (03CR) 10Nikerabbit: [C: 03+1] TranslationNotifications: Run UnsubscribeInactiveUsers periodically [puppet] - 10https://gerrit.wikimedia.org/r/928159 (https://phabricator.wikimedia.org/T323192) (owner: 10Abijeet Patro) [07:04:07] !log taavi@deploy1002 Started scap: Backport for [[gerrit:934723|Enable edit-in-sequence in Italian Wikisource (T340847)]] [07:04:10] T340847: Enable edit-in-sequence on it.wikisource - https://phabricator.wikimedia.org/T340847 [07:13:35] !log taavi@deploy1002 soda and taavi: Backport for [[gerrit:934723|Enable edit-in-sequence in Italian Wikisource (T340847)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:13:38] T340847: Enable edit-in-sequence on it.wikisource - https://phabricator.wikimedia.org/T340847 [07:13:46] Sohom_Datta: please test your patch [07:14:14] Yep on it :) [07:16:10] Looks good :) thanks for deploying :) [07:16:16] great, syncing [07:22:29] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:934723|Enable edit-in-sequence in Italian Wikisource (T340847)]] (duration: 18m 21s) [07:22:32] T340847: Enable edit-in-sequence on it.wikisource - https://phabricator.wikimedia.org/T340847 [07:22:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:23:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932642 (https://phabricator.wikimedia.org/T340397) (owner: 10Msz2001) [07:23:40] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:24:05] (03Merged) 10jenkins-bot: Update plwiki autopromote per consensus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932642 (https://phabricator.wikimedia.org/T340397) (owner: 10Msz2001) [07:24:23] !log taavi@deploy1002 Started scap: Backport for [[gerrit:932642|Update plwiki autopromote per consensus (T340397)]] [07:24:26] T340397: Request for a change of parameters for pl:wiki flagged revisions - https://phabricator.wikimedia.org/T340397 [07:25:46] !log taavi@deploy1002 msz2001 and taavi: Backport for [[gerrit:932642|Update plwiki autopromote per consensus (T340397)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [07:25:59] Kizule: is it possible to test your patch? [07:26:17] taavi: Not really. [07:26:28] yeah I guessed that. I'll just sync directly [07:27:02] Sounds good to me. [07:27:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:32:12] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:932642|Update plwiki autopromote per consensus (T340397)]] (duration: 07m 48s) [07:32:15] T340397: Request for a change of parameters for pl:wiki flagged revisions - https://phabricator.wikimedia.org/T340397 [07:32:25] done! [07:33:06] Thank you, wiki looks fine! [07:34:10] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:37:12] (03PS1) 10Majavah: P:toolforge: use toolforge.org as primary mail domain [puppet] - 10https://gerrit.wikimedia.org/r/934864 [07:38:15] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42164/console" [puppet] - 10https://gerrit.wikimedia.org/r/934864 (owner: 10Majavah) [07:48:13] (03PS1) 10Elukey: services: add dbproxy1027 to netpolicies in toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/934866 [07:51:04] (03CR) 10Elukey: [C: 03+2] services: add dbproxy1027 to netpolicies in toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/934866 (owner: 10Elukey) [07:53:23] (03CR) 10Filippo Giunchedi: [C: 03+1] Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) (owner: 10Cwhite) [07:54:04] (03PS3) 10Ladsgroup: Set externallinks migration to read new everywhere except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931240 (https://phabricator.wikimedia.org/T335343) [07:54:33] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: sync [07:54:46] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: sync [07:58:04] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:58:10] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:05:18] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:06:11] (03PS1) 10Urbanecm: linkrecommendation: Disable cronjob in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/934867 (https://phabricator.wikimedia.org/T334928) [08:15:32] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:17:47] (03PS1) 10Btullis: Deploy updated datahub images [deployment-charts] - 10https://gerrit.wikimedia.org/r/934981 (https://phabricator.wikimedia.org/T329514) [08:21:34] (03PS1) 10Muehlenhoff: Remove access for tmtl.io contractors [puppet] - 10https://gerrit.wikimedia.org/r/934988 [08:22:18] (03CR) 10CI reject: [V: 04-1] Remove access for tmtl.io contractors [puppet] - 10https://gerrit.wikimedia.org/r/934988 (owner: 10Muehlenhoff) [08:25:16] (03PS2) 10Muehlenhoff: Remove access for tmtl.io contractors [puppet] - 10https://gerrit.wikimedia.org/r/934988 [08:29:22] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for tmtl.io contractors [puppet] - 10https://gerrit.wikimedia.org/r/934988 (owner: 10Muehlenhoff) [08:31:56] (03CR) 10Gmodena: [C: 03+1] Bump stream versions in mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/934719 (https://phabricator.wikimedia.org/T340746) (owner: 10TChin) [08:33:33] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:39:25] (03PS1) 10Muehlenhoff: Remove access for aranyap [puppet] - 10https://gerrit.wikimedia.org/r/934989 [08:40:03] jouncebot: nowandnext [08:40:03] No deployments scheduled for the next 1 hour(s) and 19 minute(s) [08:40:04] In 1 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T1000) [08:40:10] (03CR) 10Btullis: [C: 03+2] Deploy updated datahub images [deployment-charts] - 10https://gerrit.wikimedia.org/r/934981 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:40:55] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for aranyap [puppet] - 10https://gerrit.wikimedia.org/r/934989 (owner: 10Muehlenhoff) [08:41:12] (03Merged) 10jenkins-bot: Deploy updated datahub images [deployment-charts] - 10https://gerrit.wikimedia.org/r/934981 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:44:31] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [08:44:36] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Aranyap out of all services on: 760 hosts [08:44:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Aranyap out of all services on: 760 hosts [08:45:09] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Aranyap out of all services on: 1271 hosts [08:45:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Aranyap out of all services on: 1271 hosts [08:45:53] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Aranyap out of all services on: 20 hosts [08:45:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Aranyap out of all services on: 20 hosts [08:46:32] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dasm out of all services on: 20 hosts [08:46:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dasm out of all services on: 20 hosts [08:46:44] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dasm out of all services on: 1271 hosts [08:46:58] PROBLEM - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:47:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dasm out of all services on: 1271 hosts [08:47:31] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dasm out of all services on: 760 hosts [08:47:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dasm out of all services on: 760 hosts [08:48:01] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:48:11] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging David.pujol out of all services on: 760 hosts [08:48:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging David.pujol out of all services on: 760 hosts [08:48:42] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging David.pujol out of all services on: 1271 hosts [08:49:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging David.pujol out of all services on: 1271 hosts [08:49:31] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dasm out of all services on: 20 hosts [08:49:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dasm out of all services on: 20 hosts [08:49:55] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Damiendf out of all services on: 20 hosts [08:50:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Damiendf out of all services on: 20 hosts [08:50:26] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Damiendf out of all services on: 1271 hosts [08:50:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Damiendf out of all services on: 1271 hosts [08:51:08] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Damiendf out of all services on: 760 hosts [08:51:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Damiendf out of all services on: 760 hosts [08:51:44] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Tom Magerlein out of all services on: 760 hosts [08:51:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Tom Magerlein out of all services on: 760 hosts [08:52:04] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Tom Magerlein out of all services on: 1271 hosts [08:52:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Tom Magerlein out of all services on: 1271 hosts [08:52:54] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Tom Magerlein out of all services on: 20 hosts [08:52:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Tom Magerlein out of all services on: 20 hosts [08:53:34] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Skye Berghel out of all services on: 20 hosts [08:53:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Skye Berghel out of all services on: 20 hosts [08:53:50] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Skye Berghel out of all services on: 1271 hosts [08:54:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Skye Berghel out of all services on: 1271 hosts [08:54:41] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Skye Berghel out of all services on: 760 hosts [08:54:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Skye Berghel out of all services on: 760 hosts [08:55:05] (03PS5) 10Fabfur: varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) [08:55:16] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Michael.hay out of all services on: 760 hosts [08:55:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Michael.hay out of all services on: 760 hosts [08:55:33] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10Volans) @fnegri thanks for the work on this! I think that as an interim workaround this is... [08:55:52] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Michael.hay out of all services on: 1271 hosts [08:56:18] (03CR) 10Vgutierrez: [C: 04-1] "text tests are still failing, it looks like text/13-tls-redirect.vtc needs to be gone" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:56:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Michael.hay out of all services on: 1271 hosts [08:56:50] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Michael.hay out of all services on: 20 hosts [08:56:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Michael.hay out of all services on: 20 hosts [08:58:19] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:01:09] jouncebot: nowandnext [09:01:09] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [09:01:09] In 0 hour(s) and 58 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T1000) [09:01:27] I have a security change to deploy, I’ll do that if nobody objects within a few minutes [09:04:07] (03CR) 10Volans: [C: 03+1] "LGTM, small nit inline. I'm not too familiar with squid syntax but looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [09:04:22] (03PS1) 10Muehlenhoff: Remove LDAP access for babiola [puppet] - 10https://gerrit.wikimedia.org/r/934991 [09:04:55] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Barakat Ajadi out of all services on: 4 hosts [09:05:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Barakat Ajadi out of all services on: 4 hosts [09:06:52] alright, going ahead [09:08:00] RECOVERY - MegaRAID on an-worker1095 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:08:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for babiola [puppet] - 10https://gerrit.wikimedia.org/r/934991 (owner: 10Muehlenhoff) [09:08:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge: use toolforge.org as primary mail domain [puppet] - 10https://gerrit.wikimedia.org/r/934864 (owner: 10Majavah) [09:09:20] moritzm: it seems we clicked submit at the exact same time [09:10:38] merging! [09:13:08] (03CR) 10Volans: [C: 03+2] "Just documentation, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/933891 (owner: 10Volans) [09:13:12] !log lucaswerkmeister-wmde: Deployed security patch for T339016 [09:14:40] * Lucas_WMDE done [09:15:17] (03PS1) 10Muehlenhoff: Remove access for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/934993 [09:17:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10BTullis) [09:17:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/934993 (owner: 10Muehlenhoff) [09:17:59] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bruno Scarone out of all services on: 20 hosts [09:18:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bruno Scarone out of all services on: 20 hosts [09:18:10] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bruno Scarone out of all services on: 1271 hosts [09:18:36] arturo: thx! [09:18:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bruno Scarone out of all services on: 1271 hosts [09:19:03] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bruno Scarone out of all services on: 760 hosts [09:19:08] 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10BTullis) p:05Triage→03Medium a:03Jclark-ctr Hi @Jclark-ctr - We've had another RAID controller fail from the same batch of servers again. Would you be able to replace it pleas... [09:19:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bruno Scarone out of all services on: 760 hosts [09:19:37] (03PS6) 10Fabfur: varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) [09:19:51] ACKNOWLEDGEMENT - MegaRAID on an-worker1095 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T340946 - Requested replacement https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:20:01] (03CR) 10Ladsgroup: [C: 03+2] Set externallinks migration to read new everywhere except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931240 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [09:20:55] (03Merged) 10jenkins-bot: Set externallinks migration to read new everywhere except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931240 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [09:22:11] (03PS1) 10JMeybohm: mathoid: Switch to envoy 1.23.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934996 (https://phabricator.wikimedia.org/T300324) [09:23:58] (03PS1) 10JMeybohm: mw-debug: Switch to envoy 1.23.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934998 (https://phabricator.wikimedia.org/T300324) [09:24:00] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931240|Set externallinks migration to read new everywhere except commons (T335343)]] [09:24:03] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [09:24:49] (03CR) 10JMeybohm: [C: 03+2] mathoid: Switch to envoy 1.23.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934996 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:25:23] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931240|Set externallinks migration to read new everywhere except commons (T335343)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [09:25:49] (03Merged) 10jenkins-bot: mathoid: Switch to envoy 1.23.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934996 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:27:29] (03CR) 10Clément Goubert: [C: 03+1] mw-debug: Switch to envoy 1.23.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934998 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:27:58] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [09:28:22] (03CR) 10Sergio Gimeno: [C: 03+1] linkrecommendation: Disable cronjob in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/934867 (https://phabricator.wikimedia.org/T334928) (owner: 10Urbanecm) [09:31:33] (03PS1) 10Muehlenhoff: Remove access for appledora [puppet] - 10https://gerrit.wikimedia.org/r/934999 [09:31:45] (03CR) 10CI reject: [V: 04-1] Remove access for appledora [puppet] - 10https://gerrit.wikimedia.org/r/934999 (owner: 10Muehlenhoff) [09:32:56] 10SRE, 10User-aborrero: gerrit.w.o is not included in https://config-master.wikimedia.org/known_hosts - https://phabricator.wikimedia.org/T340947 (10aborrero) p:05Triage→03Low [09:33:09] (03PS2) 10Muehlenhoff: Remove access for appledora [puppet] - 10https://gerrit.wikimedia.org/r/934999 [09:34:11] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP host [09:34:24] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP host [09:34:46] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931240|Set externallinks migration to read new everywhere except commons (T335343)]] (duration: 10m 46s) [09:34:49] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [09:34:58] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [09:35:30] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [09:35:38] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [09:36:06] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [09:36:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for appledora [puppet] - 10https://gerrit.wikimedia.org/r/934999 (owner: 10Muehlenhoff) [09:37:05] (03CR) 10JMeybohm: [C: 03+2] mw-debug: Switch to envoy 1.23.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934998 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:37:42] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Appledora out of all services on: 760 hosts [09:37:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Appledora out of all services on: 760 hosts [09:38:07] (03Merged) 10jenkins-bot: mw-debug: Switch to envoy 1.23.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934998 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:44:23] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Appledora out of all services on: 1271 hosts [09:45:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Appledora out of all services on: 1271 hosts [09:47:39] (03PS1) 10JMeybohm: Remove kubernetesApi hack from rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/935002 (https://phabricator.wikimedia.org/T326729) [09:49:37] (03PS1) 10JMeybohm: deployment_server/kubernetes: Remove kubernetesApi hack [puppet] - 10https://gerrit.wikimedia.org/r/935003 (https://phabricator.wikimedia.org/T326729) [09:50:05] 10SRE, 10SRE-swift-storage, 10Commons: Persistent 404 errors for some Commons files, fixable by overwriting - https://phabricator.wikimedia.org/T334346 (10Aklapper) 05Open→03Resolved Resolving per last two comments [09:50:19] (03CR) 10JMeybohm: [C: 03+2] Remove kubernetesApi hack from rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/935002 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [09:50:46] (03CR) 10JMeybohm: [C: 03+2] deployment_server/kubernetes: Remove kubernetesApi hack [puppet] - 10https://gerrit.wikimedia.org/r/935003 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [09:50:48] 10SRE, 10User-aborrero: gerrit.w.o is not included in https://config-master.wikimedia.org/known_hosts - https://phabricator.wikimedia.org/T340947 (10Volans) `gerrit.wikimedia.org` is correctly exported via `@@sshkey` in puppet and is present in production hosts's `ssh_known_hosts` files. It's missing from conf... [09:51:28] (03Merged) 10jenkins-bot: Remove kubernetesApi hack from rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/935002 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [09:53:15] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:54:09] 10SRE, 10User-aborrero: gerrit.w.o is not included in https://config-master.wikimedia.org/known_hosts - https://phabricator.wikimedia.org/T340947 (10Volans) If we include it in the exported ones I guess we'll need to adjust this one too: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/wmf-sre-la... [09:55:40] (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Disable cronjob in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/934867 (https://phabricator.wikimedia.org/T334928) (owner: 10Urbanecm) [09:56:37] (03Merged) 10jenkins-bot: linkrecommendation: Disable cronjob in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/934867 (https://phabricator.wikimedia.org/T334928) (owner: 10Urbanecm) [09:57:16] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [09:57:18] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [09:58:03] !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [09:58:05] !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [09:58:15] !log urbanecm@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [09:58:48] !log urbanecm@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [09:59:00] (03PS1) 10Arturo Borrero Gonzalez: team-wmcs: openstack_apis_response: cleanup definition [alerts] - 10https://gerrit.wikimedia.org/r/935006 [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T1000) [10:02:32] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [10:03:06] Amir1: you look last to deploy ^ [10:03:40] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:05:43] (03PS2) 10Arturo Borrero Gonzalez: team-wmcs: openstack_apis_response: cleanup definition [alerts] - 10https://gerrit.wikimedia.org/r/935006 [10:10:16] (03PS2) 10Hnowlan: poolcounter: emit metrics to display the type of throttling [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/934508 [10:10:22] (03CR) 10Hnowlan: poolcounter: emit metrics to display the type of throttling (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/934508 (owner: 10Hnowlan) [10:11:42] (03PS9) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [10:11:45] (03PS1) 10JMeybohm: mw-debug: Remove envoy version override in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/935008 (https://phabricator.wikimedia.org/T300324) [10:13:05] (03CR) 10JMeybohm: [C: 03+2] mw-debug: Remove envoy version override in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/935008 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [10:13:39] (03CR) 10FNegri: Allow cloudcumin hosts to connect to wm-bot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [10:14:25] (03Merged) 10jenkins-bot: mw-debug: Remove envoy version override in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/935008 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [10:15:06] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Lucy_Patterson_WMDE) ok! After a lot of back and forth we've decided openglam-de@lists.wikimedia.org would be the best solution. Could you set it up for us? We'd need to change the description a littl... [10:15:53] (03PS10) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [10:16:41] (03CR) 10FNegri: "I also did another minor change, adding _port to make the three acls more similar and easy to identify (_port, _src, _dst)." [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [10:16:43] (03Abandoned) 10AikoChou: ml-services: increase memory resources for readability isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/934582 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [10:17:43] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:18:11] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:18:47] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:19:21] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:20:14] (03CR) 10Hnowlan: [C: 03+2] poolcounter: emit metrics to display the type of throttling [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/934508 (owner: 10Hnowlan) [10:20:41] (03CR) 10FNegri: "LGTM. I'm not 100% sure if "for:" defaults to "0" when omitted, but I assume so?" [alerts] - 10https://gerrit.wikimedia.org/r/935006 (owner: 10Arturo Borrero Gonzalez) [10:21:35] (03PS1) 10Muehlenhoff: Remove LDAP access for simone-this-dot [puppet] - 10https://gerrit.wikimedia.org/r/935010 [10:23:18] 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10jijiki) 05Open→03Resolved Done. [10:23:36] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [10:23:39] (03CR) 10Vgutierrez: [C: 04-1] varnish: Remove http/https redirection (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [10:23:41] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for simone-this-dot [puppet] - 10https://gerrit.wikimedia.org/r/935010 (owner: 10Muehlenhoff) [10:24:40] (03Merged) 10jenkins-bot: poolcounter: emit metrics to display the type of throttling [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/934508 (owner: 10Hnowlan) [10:27:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:07] (03PS1) 10Muehlenhoff: Remove LDAP access for lwatson [puppet] - 10https://gerrit.wikimedia.org/r/935012 [10:29:41] (03CR) 10Vgutierrez: [C: 04-1] "looking good :) tests are now (PS6) happy for both text and upload, please address the outstanding comments." [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [10:31:33] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Ladsgroup) 05Open→03Resolved Done \o/: https://lists.wikimedia.org/postorius/lists/openglam-de.lists.wikimedia.org [10:33:24] (03PS2) 10Effie Mouzeli: (WIP) hieradata: Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) [10:34:26] (03PS3) 10Effie Mouzeli: (WIP) Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) [10:34:45] (03PS2) 10Effie Mouzeli: (WIP) hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) [10:34:57] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [10:39:48] (03CR) 10Arturo Borrero Gonzalez: team-wmcs: openstack_apis_response: cleanup definition (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/935006 (owner: 10Arturo Borrero Gonzalez) [10:42:28] !log imported envoyproxy 1.23.10 to buster-wikimedia, bullseye-wikimedia, bookworm-wikimedia - T300324 [10:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:32] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [10:42:36] (03PS7) 10Fabfur: varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) [10:42:58] (03PS1) 10Clément Goubert: mediawiki: Set PHP slowlog_timeout to 5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935013 [10:43:07] (03PS2) 10Clément Goubert: mediawiki: Set PHP slowlog_timeout to 5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935013 [10:43:33] (03CR) 10Fabfur: [C: 04-1] "Waiting for another pair of eyes" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [10:45:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Set PHP slowlog_timeout to 5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935013 (owner: 10Clément Goubert) [10:45:37] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Set PHP slowlog_timeout to 5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935013 (owner: 10Clément Goubert) [10:46:20] (03Merged) 10jenkins-bot: mediawiki: Set PHP slowlog_timeout to 5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935013 (owner: 10Clément Goubert) [10:46:50] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:46:53] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:47:07] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:47:33] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:47:44] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:48:04] !log Re-activating Vodafone DE peering at AMS-IX T340670 [10:48:07] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:07] T340670: Connection errors from users on Vodafone DE (AS3209) [28.06.2023] - https://phabricator.wikimedia.org/T340670 [10:48:14] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [10:48:21] (03CR) 10Fabfur: varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [10:48:54] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:49:39] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:49:50] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:50:24] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:50:33] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:51:08] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:51:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:51:47] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:51:56] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:52:19] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:52:54] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [10:52:57] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [10:52:57] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [10:53:11] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [10:53:13] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:53:19] !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [10:53:19] !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [10:53:33] !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [10:53:36] !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:55:27] 10SRE, 10Traffic, 10envoy, 10serviceops: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) [10:58:17] (03PS1) 10Hnowlan: thumbor: add poolcounter throttled class metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/935016 [11:01:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] team-wmcs: openstack_apis_response: cleanup definition [alerts] - 10https://gerrit.wikimedia.org/r/935006 (owner: 10Arturo Borrero Gonzalez) [11:03:40] (03Merged) 10jenkins-bot: team-wmcs: openstack_apis_response: cleanup definition [alerts] - 10https://gerrit.wikimedia.org/r/935006 (owner: 10Arturo Borrero Gonzalez) [11:04:58] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42173/console" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [11:05:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/933892 (owner: 10Volans) [11:05:32] (03CR) 10Fabfur: [V: 03+1 C: 03+2] varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [11:06:31] (03PS2) 10Hnowlan: thumbor: add poolcounter throttled class metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/935016 [11:07:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/934595 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [11:09:01] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Appledora out of all services on: 20 hosts [11:09:03] 10SRE, 10Infrastructure-Foundations, 10netops: Connection errors from users on Vodafone DE (AS3209) [28.06.2023] - https://phabricator.wikimedia.org/T340670 (10cmooney) 05Open→03Resolved a:03cmooney Session is re-established ~20 mins now and there has been no increase in NELs for this ASN. Marking as... [11:09:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Appledora out of all services on: 20 hosts [11:09:12] (03CR) 10Kamila Součková: [C: 03+1] thumbor: add poolcounter throttled class metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/935016 (owner: 10Hnowlan) [11:09:22] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AKhatun out of all services on: 20 hosts [11:09:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AKhatun out of all services on: 20 hosts [11:09:33] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AKhatun out of all services on: 1271 hosts [11:09:55] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [11:10:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AKhatun out of all services on: 1271 hosts [11:10:57] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AKhatun out of all services on: 760 hosts [11:11:00] (03PS2) 10Effie Mouzeli: (WIP) service: Add kubestagemaster service (#1) [puppet] - 10https://gerrit.wikimedia.org/r/934552 (https://phabricator.wikimedia.org/T329827) [11:11:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AKhatun out of all services on: 760 hosts [11:11:18] (03CR) 10Jbond: "see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/934602 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [11:12:39] (03PS1) 10Muehlenhoff: Remove access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/935020 [11:15:19] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add VIP for kubestagemaster - jiji@cumin1001" [11:15:45] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/934668 (owner: 10Majavah) [11:16:05] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/935020 (owner: 10Muehlenhoff) [11:16:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for lwatson [puppet] - 10https://gerrit.wikimedia.org/r/935012 (owner: 10Muehlenhoff) [11:16:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add VIP for kubestagemaster - jiji@cumin1001" [11:16:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:17:26] (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/934669 (owner: 10Majavah) [11:17:29] (03CR) 10Jbond: [C: 03+2] keyholder: systemd-ify [puppet] - 10https://gerrit.wikimedia.org/r/934668 (owner: 10Majavah) [11:17:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [11:18:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/933892 (owner: 10Volans) [11:19:15] (03Abandoned) 10Muehlenhoff: Add a systemd timer to cleanup cookbooks_testing [puppet] - 10https://gerrit.wikimedia.org/r/933882 (owner: 10Muehlenhoff) [11:21:35] (03PS1) 10Muehlenhoff: Remove access for jameel [puppet] - 10https://gerrit.wikimedia.org/r/935024 [11:22:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [11:22:40] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jameel [puppet] - 10https://gerrit.wikimedia.org/r/935024 (owner: 10Muehlenhoff) [11:23:37] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/934464 [11:24:27] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jameel Kaisar out of all services on: 760 hosts [11:24:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jameel Kaisar out of all services on: 760 hosts [11:24:43] 10SRE, 10Infrastructure-Foundations, 10User-aborrero: gerrit.w.o is not included in https://config-master.wikimedia.org/known_hosts - https://phabricator.wikimedia.org/T340947 (10LSobanski) [11:24:51] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jameel Kaisar out of all services on: 1271 hosts [11:25:02] (03CR) 10Volans: [C: 03+2] test-cookbook: do not run as root [puppet] - 10https://gerrit.wikimedia.org/r/933892 (owner: 10Volans) [11:25:10] (03CR) 10Hnowlan: [C: 03+2] thumbor: add poolcounter throttled class metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/935016 (owner: 10Hnowlan) [11:25:22] 10SRE, 10Infrastructure-Foundations: compare Probenet data w/ NEL data - https://phabricator.wikimedia.org/T337317 (10LSobanski) [11:25:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jameel Kaisar out of all services on: 1271 hosts [11:26:01] (03PS3) 10Effie Mouzeli: service::catalog: Add kubestagemaster service (#1) [puppet] - 10https://gerrit.wikimedia.org/r/934552 (https://phabricator.wikimedia.org/T329827) [11:26:13] (03Merged) 10jenkins-bot: thumbor: add poolcounter throttled class metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/935016 (owner: 10Hnowlan) [11:26:21] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jameel Kaisar out of all services on: 20 hosts [11:26:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jameel Kaisar out of all services on: 20 hosts [11:26:28] (03PS4) 10Effie Mouzeli: Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) [11:26:34] (03CR) 10David Caro: "I think that we should not rely on the behavior of prometheus when there's only one value in the series... doing some tests I think we sho" [alerts] - 10https://gerrit.wikimedia.org/r/935006 (owner: 10Arturo Borrero Gonzalez) [11:27:29] (03PS1) 10JMeybohm: Don't restart systemd service on upgrade [debs/envoyproxy] (v1.23) - 10https://gerrit.wikimedia.org/r/935030 (https://phabricator.wikimedia.org/T340955) [11:27:32] (03PS3) 10Effie Mouzeli: hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) [11:30:14] (03PS4) 10Effie Mouzeli: hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) [11:30:14] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [11:30:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:32:56] PROBLEM - Check systemd state on puppetboard1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetboard.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:57] 10SRE, 10Domains: Mark Monitor administration panel (redirects for wikimedia.pl) - https://phabricator.wikimedia.org/T333827 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving as there's nothing for SRE to do right now. @Jacek_Broda_WMPL, please reopen if needed after the conversation with Legal. [11:33:01] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:33:14] (03PS1) 10Muehlenhoff: Extend access for nickifeajika [puppet] - 10https://gerrit.wikimedia.org/r/935032 [11:34:13] (03CR) 10Jbond: [C: 03+1] "lgtm, couple of minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [11:34:30] (03CR) 10Jbond: sre.hardware: Add support for adding csrf-token (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [11:35:07] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:35:14] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for nickifeajika [puppet] - 10https://gerrit.wikimedia.org/r/935032 (owner: 10Muehlenhoff) [11:35:19] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:36:26] (03PS5) 10Effie Mouzeli: Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) [11:37:28] (03PS1) 10Func: SpecialLog: Fix issues related to IP users [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934624 (https://phabricator.wikimedia.org/T338042) [11:38:03] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:40:43] (03CR) 10Vgutierrez: [C: 03+1] service::catalog: Add kubestagemaster service (#1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934552 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [11:42:50] (03PS1) 10Muehlenhoff: Remove expiry data/contact for fnavas-foundation [puppet] - 10https://gerrit.wikimedia.org/r/935033 [11:43:13] (03PS7) 10Jbond: sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 [11:43:26] PROBLEM - Check systemd state on puppetboard2003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetboard.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:53] (03CR) 10Vgutierrez: [C: 03+1] Add profile::lvs::realserver to kubestagemaster (#2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [11:46:54] (03CR) 10Volans: [C: 03+1] "LGTM, feel free to test it or merge it" [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [11:47:17] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10LSobanski) @jbond could you confirm that this is still a valid request? If yes, do you think there is a better match for it than #infrastru... [11:48:58] (03CR) 10Muehlenhoff: [C: 03+2] Remove expiry data/contact for fnavas-foundation [puppet] - 10https://gerrit.wikimedia.org/r/935033 (owner: 10Muehlenhoff) [11:50:09] I plan to deploy cxserver/MinT. [11:51:01] (03PS1) 10Effie Mouzeli: Convert kubestagemaster from CNAME to A record [dns] - 10https://gerrit.wikimedia.org/r/935035 (https://phabricator.wikimedia.org/T329827) [11:51:33] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-07-03-045311-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934720 (https://phabricator.wikimedia.org/T285217) (owner: 10KartikMistry) [11:51:41] (03PS6) 10Effie Mouzeli: Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) [11:51:41] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10phaultfinder) [11:51:43] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340961 (10phaultfinder) [11:51:45] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340962 (10phaultfinder) [11:51:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340963 (10phaultfinder) [11:51:49] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340964 (10phaultfinder) [11:51:51] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340966 (10phaultfinder) [11:51:53] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340965 (10phaultfinder) [11:51:56] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340967 (10phaultfinder) [11:51:58] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340968 (10phaultfinder) [11:52:00] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340969 (10phaultfinder) [11:52:04] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340970 (10phaultfinder) [11:52:06] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340971 (10phaultfinder) [11:52:22] (03Merged) 10jenkins-bot: Update cxserver to 2023-07-03-045311-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934720 (https://phabricator.wikimedia.org/T285217) (owner: 10KartikMistry) [11:52:47] (03PS1) 10Muehlenhoff: Remove LDAP access for s-mukuti [puppet] - 10https://gerrit.wikimedia.org/r/935036 [11:54:36] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for s-mukuti [puppet] - 10https://gerrit.wikimedia.org/r/935036 (owner: 10Muehlenhoff) [11:55:04] (03PS2) 10Effie Mouzeli: Convert kubestagemaster from CNAME to A record (#4) [dns] - 10https://gerrit.wikimedia.org/r/935035 (https://phabricator.wikimedia.org/T329827) [11:57:46] (03PS1) 10Muehlenhoff: Remove LDAP access for kchapman [puppet] - 10https://gerrit.wikimedia.org/r/935038 [11:58:19] (03CR) 10CI reject: [V: 04-1] SpecialLog: Fix issues related to IP users [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934624 (https://phabricator.wikimedia.org/T338042) (owner: 10Func) [11:58:52] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for kchapman [puppet] - 10https://gerrit.wikimedia.org/r/935038 (owner: 10Muehlenhoff) [11:59:06] (03CR) 10Func: "recheck" [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934624 (https://phabricator.wikimedia.org/T338042) (owner: 10Func) [12:00:23] (03PS1) 10Arturo Borrero Gonzalez: dynamicproxy: api: install newer version of python3-flask-sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/935039 (https://phabricator.wikimedia.org/T340881) [12:00:45] (03PS1) 10Muehlenhoff: Remove access for jminor [puppet] - 10https://gerrit.wikimedia.org/r/935040 [12:01:04] Is https://phabricator.wikimedia.org/T337405 change applied in all services, good to go? I need to update cxserver/MinT [12:01:33] akosiaris: ^ [12:02:03] 10Puppet, 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) [12:02:58] kart_: ? [12:03:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jminor [puppet] - 10https://gerrit.wikimedia.org/r/935040 (owner: 10Muehlenhoff) [12:03:35] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging JMinor out of all services on: 20 hosts [12:03:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging JMinor out of all services on: 20 hosts [12:03:47] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging JMinor out of all services on: 1271 hosts [12:04:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging JMinor out of all services on: 1271 hosts [12:04:24] 10Puppet, 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) @LSobanski i have added observability alerting. Im not real sure if that's the best group but it is abo... [12:04:35] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging JMinor out of all services on: 760 hosts [12:04:39] akosiaris: see undeployed change in cxserver for example. [12:04:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging JMinor out of all services on: 760 hosts [12:04:52] It was part of T337405 [12:04:52] T337405: Refactor envoy.filters.http.router and envoy.filters.listener.tls_inspector - https://phabricator.wikimedia.org/T337405 [12:05:03] (03CR) 10Jbond: [C: 03+2] "tested and working, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [12:05:40] akosiaris: If it is OK to go, I would like to deploy cxserver and MinT :) [12:05:56] I think the question is for jayme then [12:06:22] OK. I'll hold on. [12:07:16] kart_: yes, please go ahead! [12:08:01] (03Merged) 10jenkins-bot: sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [12:08:22] jayme: Thanks! [12:08:47] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:08:47] (03PS1) 10Muehlenhoff: Remove access for awjrichards [puppet] - 10https://gerrit.wikimedia.org/r/935041 [12:09:09] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:10:33] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for awjrichards [puppet] - 10https://gerrit.wikimedia.org/r/935041 (owner: 10Muehlenhoff) [12:11:44] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Awjrichards out of all services on: 760 hosts [12:11:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Awjrichards out of all services on: 760 hosts [12:12:04] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:12:07] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Awjrichards out of all services on: 20 hosts [12:12:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Awjrichards out of all services on: 20 hosts [12:12:19] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Awjrichards out of all services on: 1271 hosts [12:12:39] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:12:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Awjrichards out of all services on: 1271 hosts [12:17:43] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:18:21] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:20:54] (03PS1) 10Vivian Rook: More API servers for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/935046 (https://phabricator.wikimedia.org/T340980) [12:22:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet [12:22:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] More API servers for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/935046 (https://phabricator.wikimedia.org/T340980) (owner: 10Vivian Rook) [12:23:20] !log Updated cxserver to 2023-07-03-045311-production (T285217) [12:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:23] T285217: TypeError: textblock.getHtml is not a function - https://phabricator.wikimedia.org/T285217 [12:25:05] (03PS3) 10KartikMistry: Update MinT to 2023-06-29-061037-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933698 (https://phabricator.wikimedia.org/T340709) [12:26:11] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-06-29-061037-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933698 (https://phabricator.wikimedia.org/T340709) (owner: 10KartikMistry) [12:27:12] (03Merged) 10jenkins-bot: Update MinT to 2023-06-29-061037-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933698 (https://phabricator.wikimedia.org/T340709) (owner: 10KartikMistry) [12:27:17] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10fgiunchedi) [12:27:19] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340961 (10fgiunchedi) [12:27:21] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340962 (10fgiunchedi) [12:27:23] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340963 (10fgiunchedi) [12:27:25] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340965 (10fgiunchedi) [12:27:27] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340964 (10fgiunchedi) [12:27:39] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10fgiunchedi) [12:27:41] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340968 (10fgiunchedi) [12:27:43] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340966 (10fgiunchedi) [12:27:45] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340967 (10fgiunchedi) [12:27:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340969 (10fgiunchedi) [12:27:49] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340970 (10fgiunchedi) [12:27:53] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340971 (10fgiunchedi) [12:27:55] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10fgiunchedi) [12:29:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet [12:29:29] (03PS1) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 [12:29:55] (03CR) 10CI reject: [V: 04-1] apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [12:30:13] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10fgiunchedi) I merged all duplicates, there were timeouts from phalerts talking to the phab api: ` Jul 03 11:51:11 alert1001 phalerts[29762]: 2023-07-03 11:51:11,937 INFO: Looking for tasks with title='ManagementSSHDown' in ['PHID-PROJ... [12:31:21] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [12:33:32] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [12:34:02] (03PS2) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 [12:34:28] (03CR) 10Arturo Borrero Gonzalez: "If you like this idea I can work on the rspec tests." [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [12:35:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet [12:35:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet [12:37:26] (03CR) 10Jon Harald Søby: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935048 (https://phabricator.wikimedia.org/T340981) (owner: 10Jon Harald Søby) [12:38:37] (03PS1) 10Marostegui: dbproxy1025: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/935049 [12:39:31] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) Thanks for taking a look at this I ended up creating a [[ https://gist.github.com/b4ldr/6822facfe4454c9bf6... [12:39:35] (03CR) 10Marostegui: [C: 03+2] dbproxy1025: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/935049 (owner: 10Marostegui) [12:41:16] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [12:41:31] (03CR) 10Muehlenhoff: apt: add package_from_bpo define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [12:42:04] (03PS1) 10Jgiannelos: wikifeeds: Add CSP headers for restbase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) [12:42:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet [12:42:45] (03CR) 10Jgiannelos: "This patch adds as default CSP headers in wikifeeds level what restbase serves." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [12:42:58] (03PS1) 10Btullis: Bump the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935052 (https://phabricator.wikimedia.org/T329514) [12:46:16] (03PS1) 10Btullis: Deploy the mediawiki hisory snapshot for June 2023 to AQS [puppet] - 10https://gerrit.wikimedia.org/r/935053 [12:46:48] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [12:46:56] (03CR) 10Btullis: [C: 03+2] Bump the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935052 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:47:16] (03CR) 10Btullis: [C: 03+2] Deploy the mediawiki hisory snapshot for June 2023 to AQS [puppet] - 10https://gerrit.wikimedia.org/r/935053 (owner: 10Btullis) [12:48:31] (03Merged) 10jenkins-bot: Bump the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935052 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:50:25] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:51:03] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [12:53:12] (03CR) 10Vgutierrez: [C: 03+1] Convert kubestagemaster from CNAME to A record (#4) [dns] - 10https://gerrit.wikimedia.org/r/935035 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [12:57:02] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [12:59:14] !log Updated MinT to 2023-06-29-061037-production (T340709 + Fixed repeatation with Santali) [12:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:17] T340709: MinT adds extra space while translating fullstop to arabic full stop - https://phabricator.wikimedia.org/T340709 [12:59:40] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Lucy_Patterson_WMDE) thank you!! [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T1300). [13:00:05] Func and Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ [13:00:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1010.eqiad.wmnet [13:00:17] present [13:00:24] I can’t deploy today, sorry [13:00:25] I can deploy today [13:00:33] hi Func / Jhs [13:00:37] o/ [13:00:40] ahoy [13:00:45] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:00:51] (03CR) 10Urbanecm: [C: 03+2] SpecialLog: Fix issues related to IP users [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934624 (https://phabricator.wikimedia.org/T338042) (owner: 10Func) [13:00:55] (03CR) 10Urbanecm: [C: 03+2] Set wgCollectionDisableSidebarLink for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935048 (https://phabricator.wikimedia.org/T340981) (owner: 10Jon Harald Søby) [13:02:06] (03Merged) 10jenkins-bot: Set wgCollectionDisableSidebarLink for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935048 (https://phabricator.wikimedia.org/T340981) (owner: 10Jon Harald Søby) [13:02:47] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:935048|Set wgCollectionDisableSidebarLink for nowiki (T340981)]] [13:02:49] 10SRE, 10Traffic: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 (10Vgutierrez) [13:02:50] T340981: Remove "Create a book" link from the Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T340981 [13:03:04] 10SRE, 10Traffic: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 (10Vgutierrez) p:05Triage→03Medium [13:04:15] !log urbanecm@deploy1002 jhsoby and urbanecm: Backport for [[gerrit:935048|Set wgCollectionDisableSidebarLink for nowiki (T340981)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:04:32] Jhs: can you test your patch at mwdebug1001? [13:05:02] urbanecm, confirmed, works as it should [13:05:08] awesome, proceeding [13:05:42] hah, when testing with random page i hit one of the first articles i created back in March 2005 :D [13:06:27] that reminds me of the Wikipediholism test [13:06:51] "When you click "Random Page" do you, more often than not, find an article you have written? (5)" [13:07:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1010.eqiad.wmnet [13:07:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1010.eqiad.wmnet [13:07:27] urbanecm, :D [13:08:00] By the way, anyone can help to run cleanupEmptyCategories.php on idwiki for T336780? It's necessary to fix a production error. [13:08:01] T336780: Wikimedia\Assert\ParameterAssertionException: Bad value for parameter $title: should not be empty unless namespace is main - https://phabricator.wikimedia.org/T336780 [13:08:09] Func: sure [13:08:31] (03PS1) 10Jbond: P:pki::client: add ability to concat the CA file with agent certs [puppet] - 10https://gerrit.wikimedia.org/r/935059 (https://phabricator.wikimedia.org/T340557) [13:08:33] (03PS1) 10Jbond: sretest1001: enable profile::pki::client::mutual_tls_add_puppet_ca [puppet] - 10https://gerrit.wikimedia.org/r/935060 (https://phabricator.wikimedia.org/T340557) [13:09:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42177/console" [puppet] - 10https://gerrit.wikimedia.org/r/935060 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [13:09:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1011.eqiad.wmnet [13:09:59] (03CR) 10CI reject: [V: 04-1] P:pki::client: add ability to concat the CA file with agent certs [puppet] - 10https://gerrit.wikimedia.org/r/935059 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [13:10:57] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:935048|Set wgCollectionDisableSidebarLink for nowiki (T340981)]] (duration: 08m 09s) [13:10:59] T340981: Remove "Create a book" link from the Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T340981 [13:11:12] Jhs: your patch is deployed. anything else for today? [13:11:35] urbanecm, nah, that's it. unless you wanna create a couple wikis ;) [13:12:05] that's not a task for today i'm afraid :)) [13:13:56] Func: this is the current output, the wrong category entry seems to still be present. am i missing something? https://www.irccloud.com/pastebin/EzkgtYRF/ [13:14:35] hum, `The category named is not valid` [13:14:57] well, an empty name is certainly not valid for a category. [13:15:35] maybe need a manual delete query... [13:16:36] (03Merged) 10jenkins-bot: SpecialLog: Fix issues related to IP users [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934624 (https://phabricator.wikimedia.org/T338042) (owner: 10Func) [13:18:48] (03CR) 10Effie Mouzeli: [C: 03+2] service::catalog: Add kubestagemaster service (#1) [puppet] - 10https://gerrit.wikimedia.org/r/934552 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [13:18:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1011.eqiad.wmnet [13:18:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1011.eqiad.wmnet [13:19:02] (03PS2) 10Jbond: sretest1001: enable profile::pki::client::mutual_tls_add_puppet_ca [puppet] - 10https://gerrit.wikimedia.org/r/935060 (https://phabricator.wikimedia.org/T340557) [13:19:03] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:934624|SpecialLog: Fix issues related to IP users (T338042 T340929)]] [13:19:07] T338042: Special:Log should show error when the input is invalid - https://phabricator.wikimedia.org/T338042 [13:19:07] T340929: IP range block log not showing when the target is prefixed with User: - https://phabricator.wikimedia.org/T340929 [13:19:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42178/console" [puppet] - 10https://gerrit.wikimedia.org/r/935060 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [13:20:25] !log urbanecm@deploy1002 func and urbanecm: Backport for [[gerrit:934624|SpecialLog: Fix issues related to IP users (T338042 T340929)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:20:34] Func: your patch is at mwdebug1001, please test. [13:21:38] !log Run `wikiadmin2023@10.64.16.184(idwiki)> DELETE FROM `category` WHERE cat_title = ''; ` (T336780) [13:21:40] urbanecm: looks good [13:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:41] T336780: Wikimedia\Assert\ParameterAssertionException: Bad value for parameter $title: should not be empty unless namespace is main - https://phabricator.wikimedia.org/T336780 [13:21:46] and i deleted the category entry too [13:21:52] thanks [13:21:57] anything else? [13:22:13] all done [13:22:19] great. proceeding with the deployment. [13:22:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1011.eqiad.wmnet [13:22:28] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:22:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1011.eqiad.wmnet [13:22:31] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:22:36] (03PS3) 10Jbond: sretest1001: enable profile::pki::client::mutual_tls_add_puppet_ca [puppet] - 10https://gerrit.wikimedia.org/r/935060 (https://phabricator.wikimedia.org/T340557) [13:24:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42179/console" [puppet] - 10https://gerrit.wikimedia.org/r/935060 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [13:25:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet [13:26:15] (03PS3) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 [13:27:07] (03CR) 10JMeybohm: [C: 03+1] Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [13:27:35] (03CR) 10JMeybohm: [C: 03+1] hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [13:27:35] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:934624|SpecialLog: Fix issues related to IP users (T338042 T340929)]] (duration: 08m 32s) [13:27:40] T338042: Special:Log should show error when the input is invalid - https://phabricator.wikimedia.org/T338042 [13:27:40] T340929: IP range block log not showing when the target is prefixed with User: - https://phabricator.wikimedia.org/T340929 [13:27:42] Func: and, deployed. [13:27:49] so, it seems we're done here? [13:27:55] yes [13:28:09] !log UTC afternoon B&C window done [13:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:14] (03PS7) 10Effie Mouzeli: Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) [13:28:17] (03CR) 10CI reject: [V: 04-1] apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [13:28:45] (03PS5) 10Effie Mouzeli: hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) [13:30:57] (03CR) 10Jbond: "general idea sgtm, left a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [13:31:17] (03CR) 10Effie Mouzeli: [C: 03+2] Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [13:32:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet [13:32:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1011.eqiad.wmnet [13:34:12] (03PS6) 10Effie Mouzeli: hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) [13:34:14] (03PS1) 10Effie Mouzeli: service::catalog: Switch kubestagemaster service to production (#5) [puppet] - 10https://gerrit.wikimedia.org/r/935064 (https://phabricator.wikimedia.org/T329827) [13:35:50] (03PS2) 10Jbond: P:pki::client: add ability to concat the CA file with agent certs [puppet] - 10https://gerrit.wikimedia.org/r/935059 (https://phabricator.wikimedia.org/T340557) [13:35:51] (03PS4) 10Jbond: sretest1001: enable profile::pki::client::mutual_tls_add_puppet_ca [puppet] - 10https://gerrit.wikimedia.org/r/935060 (https://phabricator.wikimedia.org/T340557) [13:37:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42181/console" [puppet] - 10https://gerrit.wikimedia.org/r/935059 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [13:37:15] !log installing openjdk-8 security updates [13:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: add ability to concat the CA file with agent certs [puppet] - 10https://gerrit.wikimedia.org/r/935059 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [13:39:34] (03CR) 10Jbond: [C: 03+2] sretest1001: enable profile::pki::client::mutual_tls_add_puppet_ca [puppet] - 10https://gerrit.wikimedia.org/r/935060 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [13:39:48] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) Worth noting here that we are now expiring objects that have an expiry header set (work was T229584). [13:40:45] (03PS2) 10Kosta Harlan: Assign 'edit' right to the 'temp' group in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (owner: 10Tchanders) [13:41:26] (03CR) 10Effie Mouzeli: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/output/934557/42182/" [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [13:41:31] (03CR) 10Kosta Harlan: Assign 'edit' right to the 'temp' group in dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (owner: 10Tchanders) [13:42:03] (03CR) 10Kosta Harlan: [C: 03+1] "since this is for beta labs, we can +2 this whenever we think it's ready to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (owner: 10Tchanders) [13:43:30] (03PS1) 10Jbond: puppetdb1003: enable profile::pki::client::mutual_tls_add_puppet_ca [puppet] - 10https://gerrit.wikimedia.org/r/935066 (https://phabricator.wikimedia.org/T340557) [13:45:20] (03CR) 10Jbond: [C: 03+2] puppetdb1003: enable profile::pki::client::mutual_tls_add_puppet_ca [puppet] - 10https://gerrit.wikimedia.org/r/935066 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [13:49:27] (03CR) 10Muehlenhoff: apt: add package_from_bpo define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [13:50:07] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubestagemaster2002.codfw.wmnet with OS bullseye [13:56:38] (03Abandoned) 10JMeybohm: Don't restart systemd service on upgrade [debs/envoyproxy] (v1.23) - 10https://gerrit.wikimedia.org/r/935030 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [13:58:41] (03PS1) 10Jbond: P:pki::client: make mutual_tls_add_puppet_ca the default behaviour [puppet] - 10https://gerrit.wikimedia.org/r/935070 [13:58:44] (03PS7) 10Effie Mouzeli: hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) [13:59:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42183/console" [puppet] - 10https://gerrit.wikimedia.org/r/935070 (owner: 10Jbond) [13:59:19] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Creat cookbook to migrate serveres from the puppetmasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10SLyngshede-WMF) [13:59:33] (03PS1) 10Btullis: Add an apt mirror for the confluent-kafka 7.4 release [puppet] - 10https://gerrit.wikimedia.org/r/935071 (https://phabricator.wikimedia.org/T329514) [14:00:44] (03CR) 10CI reject: [V: 04-1] P:pki::client: make mutual_tls_add_puppet_ca the default behaviour [puppet] - 10https://gerrit.wikimedia.org/r/935070 (owner: 10Jbond) [14:01:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2002.codfw.wmnet with reason: host reimage [14:03:22] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [14:04:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2002.codfw.wmnet with reason: host reimage [14:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:02] (03PS1) 10JMeybohm: envoy: Promote 1.23.10 from envoy-future to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935073 (https://phabricator.wikimedia.org/T300324) [14:08:38] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Promote 1.23.10 from envoy-future to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935073 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [14:12:27] (03PS1) 10JMeybohm: deployment_server::general: bump default envoy version to 1.23.10 [puppet] - 10https://gerrit.wikimedia.org/r/935074 (https://phabricator.wikimedia.org/T300324) [14:13:06] 10SRE-swift-storage, 10Commons: Uploading large files to Commons almost always fails - https://phabricator.wikimedia.org/T340901 (10Yann) See T340917. I can't upload files bigger than 350 MB on the English Wikisource. Systematic fail. [14:13:11] RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:16:54] (03PS1) 10Clément Goubert: team-sre: Add warning for CentralAuth job lag [alerts] - 10https://gerrit.wikimedia.org/r/935078 (https://phabricator.wikimedia.org/T336627) [14:17:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:23:08] 10SRE, 10Traffic: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 (10Fabfur) Just as reminder: As agreed with @Vgutierrez we decided to split the current haproxy acls/other actions per frontend in hieradata, eg.: ` profile::cache::haproxy::acls: tls:... [14:24:41] RECOVERY - Check systemd state on kubestagemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2002.codfw.wmnet with OS bullseye [14:26:55] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10LDAP: Should puppet auto-restart slapd? - https://phabricator.wikimedia.org/T171191 (10jbond) i see the following in the puppet manifest so i think this has been ficxed in the mean time File['/etc/ldap/slapd.conf'] ~> Service['slapd'] [14:27:29] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10LDAP: Should puppet auto-restart slapd? - https://phabricator.wikimedia.org/T171191 (10jbond) 05Open→03Resolved a:03jbond [14:28:51] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Seen), 10User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10joanna_borun) [14:29:33] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411 (10joanna_borun) [14:30:29] 10Puppet, 10Data-Engineering-Icebox, 10observability, 10User-Elukey: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10joanna_borun) [14:31:57] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) [14:32:59] (03PS3) 10Tchanders: Assign 'edit' right to the 'temp' group in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (https://phabricator.wikimedia.org/T340457) [14:33:03] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10jbond) [14:33:05] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: Switch puppetdb to profile::java - https://phabricator.wikimedia.org/T264178 (10jbond) 05Open→03Resolved a:03jbond closing this as it looks completed but please reopen if i missed something [14:33:35] (03CR) 10Tchanders: Assign 'edit' right to the 'temp' group in dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (https://phabricator.wikimedia.org/T340457) (owner: 10Tchanders) [14:34:05] (03CR) 10Elukey: "A couple of questions:" [puppet] - 10https://gerrit.wikimedia.org/r/935071 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:39:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651 (10joanna_borun) [14:41:48] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: removing admin::groups from hiera doesn't revoke permissions - https://phabricator.wikimedia.org/T89961 (10jbond) 05Open→03Declined declining based on last comments [14:42:23] (03PS1) 10Samtar: IS-labs: Remove wmgWikibaseClientEchoIcon config for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935087 (https://phabricator.wikimedia.org/T296712) [14:42:36] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Lucy_Patterson_WMDE) thank you!! [14:43:52] (03CR) 10Tchanders: Assign 'edit' right to the 'temp' group in dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (https://phabricator.wikimedia.org/T340457) (owner: 10Tchanders) [14:44:04] jouncebot: nowandnext [14:44:04] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [14:44:04] In 0 hour(s) and 45 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T1530) [14:45:11] (going to quickly deploy a prod no-op https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/935087/) [14:45:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935087 (https://phabricator.wikimedia.org/T296712) (owner: 10Samtar) [14:46:14] (03Merged) 10jenkins-bot: IS-labs: Remove wmgWikibaseClientEchoIcon config for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935087 (https://phabricator.wikimedia.org/T296712) (owner: 10Samtar) [14:46:58] (done) [14:50:33] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: Switch puppetdb to profile::java - https://phabricator.wikimedia.org/T264178 (10MoritzMuehlenhoff) It's currently only applied to the puppet 7 puppetdbs, but not the legacy ones. But +1 on resolving, given that we'll retire t... [14:50:53] Wikibase mentioned :O [14:50:59] (jk ^^) [14:51:06] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Everything using Java is using profile::java by now. [14:51:35] :p was only T296712 bugging me every time I look at beta logstash [14:51:36] T296712: MWException: File '/srv/mediawiki/php-master/extensions//static/images/wikibase/echoIcon.svg' does not exist - https://phabricator.wikimedia.org/T296712 [14:52:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: using the include function can trigger false positives with puppet-lint-wmf_styleguide - https://phabricator.wikimedia.org/T275387 (10jbond) [14:52:47] oh, I didn’t notice the value changed [14:52:52] 'url' instead of 'path'? [14:53:07] I thought you were just removing a redundant setting that was identical in IS(-labs) ^^ [14:53:12] thanks anyway :) [14:53:19] (03PS1) 10Clément Goubert: changeprop: Change normal_rule_processing_delay to histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 [14:53:55] (03PS2) 10Effie Mouzeli: service::catalog: Switch kubestagemaster service to lvs_setup (#4) [puppet] - 10https://gerrit.wikimedia.org/r/935064 (https://phabricator.wikimedia.org/T329827) [14:53:55] * TheresNoTime did just remove it [14:55:14] (03CR) 10JMeybohm: [C: 03+1] Convert kubestagemaster from CNAME to A record (#4) [dns] - 10https://gerrit.wikimedia.org/r/935035 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [14:59:29] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [14:59:43] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: puppetmaster - ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB - https://phabricator.wikimedia.org/T255667 (10jbond) 05Open→03Declined This is a by product of having binary objects sent in the catalogue. [15:01:34] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:03:44] (03PS3) 10Effie Mouzeli: service::catalog: Switch kubestagemaster service to lvs_setup (#4) [puppet] - 10https://gerrit.wikimedia.org/r/935064 (https://phabricator.wikimedia.org/T329827) [15:03:46] (03PS1) 10Effie Mouzeli: service::catalog: Switch kubestagemaster service to production (#5) [puppet] - 10https://gerrit.wikimedia.org/r/935090 (https://phabricator.wikimedia.org/T329827) [15:03:57] (03PS3) 10Effie Mouzeli: Convert kubestagemaster from CNAME to A record (#6) [dns] - 10https://gerrit.wikimedia.org/r/935035 (https://phabricator.wikimedia.org/T329827) [15:06:54] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42184/console" [puppet] - 10https://gerrit.wikimedia.org/r/935064 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [15:08:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "looks good, please adjust the weight to non-zero:" [puppet] - 10https://gerrit.wikimedia.org/r/935064 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [15:09:38] !log installing Java 8 security updates on Hadoop systems [15:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:44] (03CR) 10Filippo Giunchedi: "Seems sensible to me, this will affect https://gerrit.wikimedia.org/r/c/operations/alerts/+/935078/1/team-sre/jobqueue.yaml (will cross-c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert) [15:09:49] (03CR) 10Filippo Giunchedi: [C: 03+1] changeprop: Change normal_rule_processing_delay to histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert) [15:10:09] (03CR) 10Filippo Giunchedi: "Will need adjustment after https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/935089" [alerts] - 10https://gerrit.wikimedia.org/r/935078 (https://phabricator.wikimedia.org/T336627) (owner: 10Clément Goubert) [15:12:42] !log jiji@cumin1001 conftool action : set/weight=10; selector: name=kubestagemaster1001.eqiad.wmnet [15:12:47] !log jiji@cumin1001 conftool action : set/weight=10; selector: name=kubestagemaster1002.eqiad.wmnet [15:14:01] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=codfw,cluster=kubernetes-staging,service=kubemaster [15:15:19] (03PS1) 10Majavah: P:toolforge: mailrelay: reject outbound emails without a sender [puppet] - 10https://gerrit.wikimedia.org/r/935093 (https://phabricator.wikimedia.org/T337259) [15:16:06] (03PS2) 10Majavah: ssh: support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/928797 (https://phabricator.wikimedia.org/T337241) [15:16:10] 10Puppet, 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10joanna_borun) It's going to be fixed with puppet 7 upgrade. [15:16:34] (03CR) 10Raymond Ndibe: "Thanks David for the work done on this! I will attempt to follow your logic and comment on anything that is confusing" [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:16:42] 10Puppet, 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10joanna_borun) 05Open→03Declined [15:17:31] (03CR) 10Btullis: Add an apt mirror for the confluent-kafka 7.4 release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935071 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:17:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42185/console" [puppet] - 10https://gerrit.wikimedia.org/r/928797 (https://phabricator.wikimedia.org/T337241) (owner: 10Majavah) [15:17:54] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:18:23] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) 05Open→03Resolved most of the useful functions and types have been upstream and we will try to include the sty... [15:20:46] (03PS1) 10Effie Mouzeli: conftool: Add kubestagemasters 1002 and 2002 (#4) [puppet] - 10https://gerrit.wikimedia.org/r/935094 (https://phabricator.wikimedia.org/T329827) [15:21:24] (03PS4) 10Effie Mouzeli: service::catalog: Switch kubestagemaster service to lvs_setup (#5) [puppet] - 10https://gerrit.wikimedia.org/r/935064 (https://phabricator.wikimedia.org/T329827) [15:21:38] (03PS2) 10Effie Mouzeli: service::catalog: Switch kubestagemaster service to production (#6) [puppet] - 10https://gerrit.wikimedia.org/r/935090 (https://phabricator.wikimedia.org/T329827) [15:21:54] (03PS4) 10Effie Mouzeli: Convert kubestagemaster from CNAME to A record (#7) [dns] - 10https://gerrit.wikimedia.org/r/935035 (https://phabricator.wikimedia.org/T329827) [15:22:44] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) [15:22:51] (03CR) 10Majavah: "I still need to test this." [puppet] - 10https://gerrit.wikimedia.org/r/935093 (https://phabricator.wikimedia.org/T337259) (owner: 10Majavah) [15:23:05] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-jbond: missing CRL - https://phabricator.wikimedia.org/T235185 (10jbond) 05Open→03Declined This will be different with the puppetserveres and not worth fixing for the puppetmasteres [15:23:46] (03PS1) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [15:24:56] (03CR) 10Btullis: Add an apt mirror for the confluent-kafka 7.4 release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935071 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:26:59] (03CR) 10Hnowlan: wikifeeds: Add CSP headers for restbase sunset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [15:27:15] (03CR) 10Vgutierrez: [C: 03+1] conftool: Add kubestagemasters 1002 and 2002 (#4) [puppet] - 10https://gerrit.wikimedia.org/r/935094 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [15:27:17] (03CR) 10Btullis: Add an apt mirror for the confluent-kafka 7.4 release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935071 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:27:57] (03CR) 10Raymond Ndibe: replica_cnf_api: refactor to use multiple backends (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:28:09] (03CR) 10Effie Mouzeli: [C: 03+2] conftool: Add kubestagemasters 1002 and 2002 (#4) [puppet] - 10https://gerrit.wikimedia.org/r/935094 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [15:28:25] 10SRE, 10Puppet-Core: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10jbond) [15:29:31] (03CR) 10Majavah: [C: 03+1] "I sometimes wonder if we should apply Priority: 100 to the osbpo repos (at least on VMs) to prevent these kinds of surprises." [puppet] - 10https://gerrit.wikimedia.org/r/935039 (https://phabricator.wikimedia.org/T340881) (owner: 10Arturo Borrero Gonzalez) [15:30:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T1530). [15:31:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: api: install newer version of python3-flask-sqlalchemy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935039 (https://phabricator.wikimedia.org/T340881) (owner: 10Arturo Borrero Gonzalez) [15:32:10] (03CR) 10Majavah: [C: 03+1] dynamicproxy: api: install newer version of python3-flask-sqlalchemy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935039 (https://phabricator.wikimedia.org/T340881) (owner: 10Arturo Borrero Gonzalez) [15:33:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: api: install newer version of python3-flask-sqlalchemy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935039 (https://phabricator.wikimedia.org/T340881) (owner: 10Arturo Borrero Gonzalez) [15:34:00] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=kubestagemaster2002.codfw.wmnet [15:34:11] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=kubestagemaster1002.eqiad.wmnet [15:34:20] !log jiji@cumin1001 conftool action : set/weight=10; selector: name=kubestagemaster2002.codfw.wmnet [15:34:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Review puppetmaster SSL configuration - https://phabricator.wikimedia.org/T268040 (10jbond) 05Open→03Resolved a:03jbond This will all change in puppet7 [15:34:32] !log jiji@cumin1001 conftool action : set/weight=10; selector: name=kubestagemaster1002.eqiad.wmnet [15:34:34] (03CR) 10Majavah: [C: 03+1] dynamicproxy: api: install newer version of python3-flask-sqlalchemy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935039 (https://phabricator.wikimedia.org/T340881) (owner: 10Arturo Borrero Gonzalez) [15:35:22] (03PS5) 10Effie Mouzeli: service::catalog: Switch kubestagemaster service to lvs_setup (#5) [puppet] - 10https://gerrit.wikimedia.org/r/935064 (https://phabricator.wikimedia.org/T329827) [15:36:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:36:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: api: install newer version of python3-flask-sqlalchemy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935039 (https://phabricator.wikimedia.org/T340881) (owner: 10Arturo Borrero Gonzalez) [15:36:27] (03CR) 10Raymond Ndibe: replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:38:34] (03CR) 10Raymond Ndibe: replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:39:58] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [15:40:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:40:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [15:43:38] (03CR) 10Raymond Ndibe: replica_cnf_api: refactor to use multiple backends (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:43:57] (03CR) 10Effie Mouzeli: [C: 03+2] service::catalog: Switch kubestagemaster service to lvs_setup (#5) [puppet] - 10https://gerrit.wikimedia.org/r/935064 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [15:46:09] (03PS2) 10Jbond: P:pki::client: make mutual_tls_add_puppet_ca the default behaviour [puppet] - 10https://gerrit.wikimedia.org/r/935070 (https://phabricator.wikimedia.org/T340557) [15:47:00] (03PS4) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 [15:47:35] (03PS5) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 [15:47:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42187/console" [puppet] - 10https://gerrit.wikimedia.org/r/935070 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [15:48:34] (03PS1) 10JMeybohm: kubernetes::deployment_server: Globally enable envoy telemetry [puppet] - 10https://gerrit.wikimedia.org/r/935097 (https://phabricator.wikimedia.org/T300324) [15:48:43] 10Puppet, 10SRE, 10Infrastructure-Foundations: empty hiera yaml file makes lookup fail - https://phabricator.wikimedia.org/T89957 (10jbond) 05Open→03Resolved a:03jbond closing this, im guessing this has been fixed upstream in the mean time as we currently have [[ https://github.com/wikimedia/operations... [15:49:51] !log restarting pybal on lvs1020 [15:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:57] (03PS2) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [15:50:09] (03PS4) 10Majavah: hieradata: add cache_hosts for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/831044 [15:50:11] (03PS4) 10Majavah: P:mariadb::cloudinfra: add web proxy database/grants [puppet] - 10https://gerrit.wikimedia.org/r/831045 (https://phabricator.wikimedia.org/T316982) [15:50:13] (03PS3) 10Majavah: dynamicproxy: remove proxygetter [puppet] - 10https://gerrit.wikimedia.org/r/928457 [15:50:15] (03PS3) 10Majavah: dynamicproxy: move api files to api/ folder [puppet] - 10https://gerrit.wikimedia.org/r/928458 [15:50:17] (03PS3) 10Majavah: mariadb::config::client: allow configuring default database [puppet] - 10https://gerrit.wikimedia.org/r/928461 [15:50:19] (03PS4) 10Majavah: dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) [15:50:21] (03PS6) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [15:50:23] (03PS1) 10Majavah: dynamicproxy: api: fix apt pinning type [puppet] - 10https://gerrit.wikimedia.org/r/935099 [15:50:37] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) mw and restbase canaries as well as mathoid are running 1.23.10 since today. If nothing comes up I will roll the update out to the rest of the fleet tom... [15:51:59] > !log restarting pybal on lvs1019 [15:52:03] !log restarting pybal on lvs1019 [15:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:22] (03CR) 10Jbond: role::mediawiki::appserver: merge role::mediawiki::common in (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526290 (owner: 10Dzahn) [15:56:28] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) [15:57:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) [15:57:06] !log restarting pybal on lvs2014 [15:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:37] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) [15:58:22] (03PS3) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [15:59:13] PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 95 connections established with conf2004.codfw.wmnet:4001 (min=96) https://wikitech.wikimedia.org/wiki/PyBal [15:59:29] effie: everything ok? :) [15:59:37] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [15:59:46] vgutierrez: yeah puppet was already running on lvs2013 [16:00:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) I have updated the description i believe the first point was to ensure we had no nodes like the following ` lang=p... [16:00:17] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.86:6443]) https://wikitech.wikimedia.org/wiki/PyBal [16:00:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: api: fix apt pinning type [puppet] - 10https://gerrit.wikimedia.org/r/935099 (owner: 10Majavah) [16:02:35] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.86:6443]) https://wikitech.wikimedia.org/wiki/PyBal [16:02:43] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1005 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [16:03:52] effie: lvs2013 got the puppet change as well [16:04:02] yeah just saw the puppet log [16:04:05] ok moveing fw [16:04:06] got it at 15:58:19 [16:04:17] > !log restarting pybal on lvs2014 [16:04:20] !log restarting pybal on lvs2014 [16:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:28] but something is off with facter on lvs2013 [16:04:30] Jul 3 15:58:56 lvs2013 puppet-agent[1421575]: Loading facts [16:04:35] that's quite slow :/ [16:04:45] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 77 connections established with conf2004.codfw.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [16:05:09] Jul 3 15:41:54 lvs2013 puppet-agent[1417206]: Loading facts [16:05:10] Jul 3 15:58:08 lvs2013 puppet-agent[1417206]: Caching catalog for lvs2013.codfw.wmnet [16:05:15] PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:17] 17 minutes loading facts? :_) [16:05:37] (03PS6) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 [16:05:49] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:06:32] the change was merged at 15:44 [16:06:56] effie: yeah.. at 15:58 it was already there [16:07:24] something else is going on on lvs2013 [16:07:24] catalog gets fetched after the facts are loaded [16:07:25] (03CR) 10MSantos: [C: 03+1] wikifeeds: Add CSP headers for restbase sunset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [16:07:43] effie: btw, that's solved by disabling puppet on the LVS before merging :) [16:08:21] yep.. management interface seems toasted in lvs2013 [16:08:29] vgutierrez: I will add it on the docs then [16:08:43] vgutierrez: I have the pybal restart of 2013 left, shall I go on with it ? [16:09:00] or you'd like us to failover to 2014 ? [16:09:37] (03CR) 10Arturo Borrero Gonzalez: apt: add package_from_bpo define (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [16:09:39] effie: go ahead [16:09:48] cheers [16:09:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet does not undo manual "systemctl mask $unit" - https://phabricator.wikimedia.org/T285425 (10jbond) > I expected that if puppet manages a systemd unit, e.g. mediawiki::periodic::job, it will ensure the unit actually runs, I think we may need to add som... [16:09:56] (03PS4) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [16:10:01] !log restarting pybal on lvs2013 [16:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:04] need to ping dcops to handle lvs2013 [16:10:17] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 96 connections established with conf2004.codfw.wmnet:4001 (min=96) https://wikitech.wikimedia.org/wiki/PyBal [16:10:17] (03CR) 10Muehlenhoff: [C: 03+1] "Can't comment on the wider Kafka quesions raised by Luca, but +1 on the reprepro side of things if we proceed with the 7.4 confluent packa" [puppet] - 10https://gerrit.wikimedia.org/r/935071 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [16:11:10] vgutierrez: on a brighter side, we found out that the host is problematic ! [16:11:34] (03CR) 10Fabfur: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42190/console" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [16:11:51] effie: yup, that's https://phabricator.wikimedia.org/T340960 [16:12:53] 10ops-codfw, 10Traffic: ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10Vgutierrez) unresponsive management interface results in puppet being super slow loading facts: ` Jul 3 15:41:54 lvs2013 puppet-agent[1417206]: Loading facts Jul 3 15:58:08 lvs2013 puppet-agent[1417206]: Caching catalog... [16:13:29] (03CR) 10Fabfur: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42191/console" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [16:13:39] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:13:58] fabfur: ^^ it looks like you're running pcc against the wrong CR [16:15:05] tnx [16:15:47] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 78 connections established with conf2004.codfw.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [16:18:28] (03PS5) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [16:20:28] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42193/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [16:21:35] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651 (10jbond) 05Open→03Declined >>! In T187651#4018556, @MoritzMuehlenhoff wrote: > I think puppet is right in overriding the local admin choice here. If the package i... [16:25:29] PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:28:54] (03PS1) 10Majavah: Uninstall Diamond everywhere [puppet] - 10https://gerrit.wikimedia.org/r/935103 (https://phabricator.wikimedia.org/T317032) [16:29:22] (03CR) 10Majavah: [C: 04-1] "Do not merge until July 17th - https://wikitech.wikimedia.org/wiki/News/2023_Cloud_VPS_metrics_changes" [puppet] - 10https://gerrit.wikimedia.org/r/935103 (https://phabricator.wikimedia.org/T317032) (owner: 10Majavah) [16:38:51] (03PS7) 10Jbond: apt: add package_from_bpo define [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [16:41:37] (03CR) 10Kosta Harlan: [C: 03+2] Assign 'edit' right to the 'temp' group in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (https://phabricator.wikimedia.org/T340457) (owner: 10Tchanders) [16:42:35] (03Merged) 10jenkins-bot: Assign 'edit' right to the 'temp' group in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (https://phabricator.wikimedia.org/T340457) (owner: 10Tchanders) [16:43:14] (03CR) 10Jbond: "LGTM but the notices still needs fixing" [puppet] - 10https://gerrit.wikimedia.org/r/935047 (owner: 10Arturo Borrero Gonzalez) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T1700) [17:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T1700). [17:11:27] (03CR) 10David Caro: replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [18:17:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:21:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:26:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:47:21] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Ladsgroup) You're welcome ^_^ [18:58:44] (SystemdUnitCrashLoop) firing: crashloop on search-loader1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:03:44] (SystemdUnitCrashLoop) resolved: crashloop on search-loader1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:07:26] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935152 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert) [19:08:06] (03CR) 10Clément Goubert: [C: 03+2] changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935152 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert) [19:08:56] (03Merged) 10jenkins-bot: changeprop-jobqueue: Bump the concurrency for parsoidCachePrewarm to 150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935152 (https://phabricator.wikimedia.org/T339867) (owner: 10Clément Goubert) [19:09:21] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [19:09:45] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [19:09:52] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [19:10:39] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [19:11:02] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [19:11:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [19:12:58] (03PS1) 10Gmodena: mw-page-content-change-enrichment stream partition WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) [19:16:49] (03CR) 10Raymond Ndibe: [C: 03+1] "other than the failing ci (which doesn't seem to be from this patch) and the few optional changes, everything else looks fine. All the tes" [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [19:17:39] (03CR) 10Gmodena: "Small patch to test the (wiki_id, page_id) partitioning of page_change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935153 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [19:20:16] (03CR) 10Raymond Ndibe: "@dcaro is there any reason we haven't yet merged this patch? everything seems ok to me or am I missing something?" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [19:21:00] (03CR) 10Raymond Ndibe: "@dcaro is there any reason we haven't yet merged this patch? everything seems ok to me or am I missing something?" [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [19:29:14] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) (owner: 10Cwhite) [20:03:21] Jul 3 20:00:20 lvs1020 pybal[572493]: [swift_80] INFO: Server ms-fe1009.eqiad.wmnet (enabled/partially up/not pooled) is up [20:03:23] Jul 3 20:01:42 lvs1020 pybal[572493]: [swift-https_443 ProxyFetch] WARN: ms-fe1009.eqiad.wmnet (enabled/up/pooled): Fetch failed (https://localhost/monitoring/frontend), 5.001 s [20:14:17] !log jiji@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [20:15:40] !log restarting swift proxies [20:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [20:57:15] (03PS1) 10Krinkle: Add vendor submodule with deps for xhgui 0.12.0 [software/xhgui] (wmf_deploy) - 10https://gerrit.wikimedia.org/r/935168 (https://phabricator.wikimedia.org/T340713) [20:57:33] (03CR) 10Krinkle: [V: 03+2 C: 03+2] Add vendor submodule with deps for xhgui 0.12.0 [software/xhgui] (wmf_deploy) - 10https://gerrit.wikimedia.org/r/935168 (https://phabricator.wikimedia.org/T340713) (owner: 10Krinkle) [21:00:06] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230703T2100). [21:09:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:17:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:29:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:31:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:34:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag