[00:27:11] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:33:58] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:37:11] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:38:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986836 [00:39:02] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986836 (owner: 10TrainBranchBot) [00:46:20] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:26] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:52:31] !log restarted prometheus@k8s on prometheus1005 and backed up the wal for OOM loop investigation [00:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:07] !log restarted prometheus@k8s on prometheus1006 and backed up the wal for OOM loop investigation [00:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986836 (owner: 10TrainBranchBot) [01:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:26:07] (03PS14) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [01:26:11] (03PS12) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [01:27:57] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=arzwiki --logwiki=metawiki 'WanderingPlaywrite' 'WanderingPlaywright' # T354397 [01:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:01] T354397: Rename stuck for three weeks - https://phabricator.wikimedia.org/T354397 [01:38:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:13:00] RECOVERY - cassandra-a CQL 10.192.48.234:9042 on restbase2034 is OK: TCP OK - 0.030 second response time on 10.192.48.234 port 9042 https://phabricator.wikimedia.org/T93886 [02:17:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:18:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:19:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:19:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:37:11] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:51:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:56:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:08:58] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:06] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:18:08] (03CR) 10Andrea Denisse: [C: 03+1] admin: add wfan219 to deployment [puppet] - 10https://gerrit.wikimedia.org/r/985331 (https://phabricator.wikimedia.org/T353958) (owner: 10Herron) [03:18:32] (KubernetesRsyslogDown) firing: (6) rsyslog on mw1378:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:20:08] 10SRE, 10SRE-Access-Requests: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse [03:42:50] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Ensure prometheus-icinga has a listening address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981407 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [03:48:01] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/987789 (owner: 10Filippo Giunchedi) [03:48:53] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/984237 (https://phabricator.wikimedia.org/T353220) (owner: 10Cwhite) [03:52:11] (03PS1) 10Tim Starling: Increase socat 6to4 buffer size [puppet] - 10https://gerrit.wikimedia.org/r/987879 (https://phabricator.wikimedia.org/T353220) [04:25:39] 10SRE, 10SRE-Access-Requests: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10andrea.denisse) Hi Arthur, I hope you're doing well. While anticipating your access request, I attempted to verify that the developer account belongs to an actual staff member, as... [04:27:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10andrea.denisse) [04:29:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse [04:31:25] (03CR) 10Andrea Denisse: [C: 03+2] admin: add dreamyjazz to deployment [puppet] - 10https://gerrit.wikimedia.org/r/984630 (https://phabricator.wikimedia.org/T353735) (owner: 10Herron) [04:38:42] (03PS2) 10Andrea Denisse: admin: add dreamyjazz to deployment [puppet] - 10https://gerrit.wikimedia.org/r/984630 (https://phabricator.wikimedia.org/T353735) (owner: 10Herron) [04:39:19] (03CR) 10CI reject: [V: 04-1] admin: add dreamyjazz to deployment [puppet] - 10https://gerrit.wikimedia.org/r/984630 (https://phabricator.wikimedia.org/T353735) (owner: 10Herron) [04:40:02] (03PS3) 10Andrea Denisse: admin: add dreamyjazz to deployment [puppet] - 10https://gerrit.wikimedia.org/r/984630 (https://phabricator.wikimedia.org/T353735) (owner: 10Herron) [04:41:25] (03CR) 10Andrea Denisse: [C: 03+2] admin: add dreamyjazz to deployment [puppet] - 10https://gerrit.wikimedia.org/r/984630 (https://phabricator.wikimedia.org/T353735) (owner: 10Herron) [04:49:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:50:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 1.316 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:17:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10andrea.denisse) Hi @Dreamy_Jazz , access to deployment granted. Could you please confirm that you can access? [05:18:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10andrea.denisse) a:03andrea.denisse [05:18:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10andrea.denisse) 05Open→03In progress Hi @XenoRyet could you please review/approve the access request as manager? [05:39:14] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:10:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:15:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:21:06] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:57:33] (03PS1) 10Marostegui: installserver: Do not reimage db1245 [puppet] - 10https://gerrit.wikimedia.org/r/987891 [06:59:11] (03PS1) 10Muehlenhoff: Remove expiry date for Tiziano Piccardi [puppet] - 10https://gerrit.wikimedia.org/r/987892 [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240105T0700) [07:01:06] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:01:49] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db1245 [puppet] - 10https://gerrit.wikimedia.org/r/987891 (owner: 10Marostegui) [07:02:59] (03CR) 10Muehlenhoff: "This misses the updates, see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [07:03:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove expiry date for Tiziano Piccardi [puppet] - 10https://gerrit.wikimedia.org/r/987892 (owner: 10Muehlenhoff) [07:18:32] (KubernetesRsyslogDown) firing: (6) rsyslog on mw1378:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:19:03] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Diskdance) [07:21:38] 10SRE, 10SRE-Access-Requests: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) Hi @andrea.denisse , Thanks for the info. I don't know what I would need to do to pass the `check_user` check. But if there's anything I can do to help un-stick that,... [07:26:29] (03PS1) 10Marostegui: report_users.sh: Improvements [software] - 10https://gerrit.wikimedia.org/r/987894 [07:26:36] (03CR) 10CI reject: [V: 04-1] report_users.sh: Improvements [software] - 10https://gerrit.wikimedia.org/r/987894 (owner: 10Marostegui) [07:28:16] (03PS1) 10Marostegui: report_users.sh: Improvements [software] - 10https://gerrit.wikimedia.org/r/987895 [07:29:04] (03Abandoned) 10Marostegui: report_users.sh: Improvements [software] - 10https://gerrit.wikimedia.org/r/987894 (owner: 10Marostegui) [07:29:16] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Improvements [software] - 10https://gerrit.wikimedia.org/r/987895 (owner: 10Marostegui) [07:29:49] (03Merged) 10jenkins-bot: report_users.sh: Improvements [software] - 10https://gerrit.wikimedia.org/r/987895 (owner: 10Marostegui) [07:34:07] (03CR) 10Muehlenhoff: [C: 03+2] Switch role::test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/987791 (owner: 10Muehlenhoff) [07:44:08] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Wikimedia-Performance-recommendation: Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Diskdance) [07:46:10] (03CR) 10ArielGlenn: add foundationwiki to the list of central auth login wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240105T0800) [08:01:28] !log installing 6.1.69 kernels on Bookworm hosts [08:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:45] (03PS1) 10Muehlenhoff: Switch netmon to nftables [puppet] - 10https://gerrit.wikimedia.org/r/987945 [08:39:10] 10SRE, 10Infrastructure-Foundations, 10netops: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 (10ayounsi) 05Open→03Resolved All good. [08:41:01] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) a:03ayounsi [08:46:56] 10SRE, 10Infrastructure-Foundations: Upgrade the IDP servers to Bookworm - https://phabricator.wikimedia.org/T354405 (10MoritzMuehlenhoff) [08:47:07] 10SRE, 10Infrastructure-Foundations: Upgrade the IDP servers to Bookworm - https://phabricator.wikimedia.org/T354405 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:49:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987945 (owner: 10Muehlenhoff) [09:03:07] 10SRE, 10Infrastructure-Foundations: Upgrade the IDP servers to Bookworm - https://phabricator.wikimedia.org/T354405 (10MoritzMuehlenhoff) Java 17 should be supported starting with 6.5 already: https://apereo.github.io/cas/6.5.x/release_notes/RC1.html : **JDK 17 Compatibility** CAS is able to build and run suc... [09:15:55] <_joe_> !log upgrading conftool across the fleet [09:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:45] (03PS2) 10Giuseppe Lavagetto: Release 4.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/977437 [09:20:34] (03PS1) 10Ayounsi: Revert "Disable Telemetry on eqsin switches" [homer/public] - 10https://gerrit.wikimedia.org/r/987741 (https://phabricator.wikimedia.org/T332395) [09:20:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 4.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/977437 (owner: 10Giuseppe Lavagetto) [09:23:07] (03Merged) 10jenkins-bot: Release 4.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/977437 (owner: 10Giuseppe Lavagetto) [09:24:07] (03PS3) 10Jelto: miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) [09:26:16] !log installing 5.10.205 kernels on Bullseye hosts [09:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:38] (03PS4) 10Jelto: miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) [09:28:16] (03PS5) 10Jelto: miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) [09:28:45] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch netmon to nftables [puppet] - 10https://gerrit.wikimedia.org/r/987945 (owner: 10Muehlenhoff) [09:39:15] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:41:38] (03PS6) 10Jelto: miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) [09:42:13] (03CR) 10Ladsgroup: [C: 03+1] "I'll deploy this on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987861 (owner: 10VolkerE) [10:02:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) Latest Junos recommended has been copied to /var/tmp/ Next steps: downtime the site and proceed with the upgrade : https://wikitech.wikimedia.org/wiki/Juniper_switch... [10:06:31] (03CR) 10JMeybohm: [C: 03+1] miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [10:25:18] 10SRE, 10conftool: conftool no longer automatically !logs changes - https://phabricator.wikimedia.org/T354209 (10taavi) 05Open→03Resolved [10:39:37] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) p:05Medium→03High This can also prevent schema changes to be fully applied to all the replicas. [10:41:40] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10dcaro) Currently the only workaround I've found (as we don't use elasticsearch itself) is to install in the local v... [10:43:43] (03PS1) 10Alexandros Kosiaris: services_proxy: Support tracing [puppet] - 10https://gerrit.wikimedia.org/r/987954 (https://phabricator.wikimedia.org/T351566) [10:48:45] (03CR) 10CI reject: [V: 04-1] services_proxy: Support tracing [puppet] - 10https://gerrit.wikimedia.org/r/987954 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [10:49:37] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10dcaro) Note that as of jan 2024, you will need also to workaround that python-kafka<=2.0.2 does not work with pytho... [10:53:53] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410 (10dcaro) [10:55:44] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410 (10dcaro) [10:55:58] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Revert dbstore migration to puppet7 - https://phabricator.wikimedia.org/T354411 (10Marostegui) [10:56:23] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10Marostegui) [11:01:15] (03PS2) 10Alexandros Kosiaris: services_proxy: Support tracing [puppet] - 10https://gerrit.wikimedia.org/r/987954 (https://phabricator.wikimedia.org/T351566) [11:04:24] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10Dreamy_Jazz) >>! In T353735#9437021, @andrea.denisse wrote: > Hi @Dreamy_Jazz , access to deployment granted. > > Could you please confirm that you can access? Hi. I seem to have acces... [11:04:45] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987954 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [11:10:55] (03PS1) 10Kamila Součková: TEMPORARY role for debugging T354413 for mw1377 [puppet] - 10https://gerrit.wikimedia.org/r/987958 (https://phabricator.wikimedia.org/T354413) [11:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:12:48] (03CR) 10Kamila Součková: [C: 03+2] TEMPORARY role for debugging T354413 for mw1377 [puppet] - 10https://gerrit.wikimedia.org/r/987958 (https://phabricator.wikimedia.org/T354413) (owner: 10Kamila Součková) [11:14:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/987954 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [11:15:20] (03PS1) 10Kamila Součková: TEMPORARY changes for debugging T354413 [puppet] - 10https://gerrit.wikimedia.org/r/987960 [11:15:50] (03CR) 10CI reject: [V: 04-1] TEMPORARY changes for debugging T354413 [puppet] - 10https://gerrit.wikimedia.org/r/987960 (owner: 10Kamila Součková) [11:18:24] (03PS1) 10Kamila Součková: Revert "TEMPORARY role for debugging T354413 for mw1377" [puppet] - 10https://gerrit.wikimedia.org/r/987743 [11:18:32] (KubernetesRsyslogDown) firing: (6) rsyslog on mw1378:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:19:32] (03CR) 10Kamila Součková: [C: 03+2] Revert "TEMPORARY role for debugging T354413 for mw1377" [puppet] - 10https://gerrit.wikimedia.org/r/987743 (owner: 10Kamila Součková) [11:19:35] (03PS1) 10Muehlenhoff: deployment servers: Set safe.directory in general [puppet] - 10https://gerrit.wikimedia.org/r/987961 (https://phabricator.wikimedia.org/T335354) [11:26:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987961 (https://phabricator.wikimedia.org/T335354) (owner: 10Muehlenhoff) [11:29:35] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Next steps to create a production grade routed cluster: # {T353935} # Assign a private and optionally public IPv4 and v6 range for codfw # Add a Hiera key `pro... [11:34:15] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10MoritzMuehlenhoff) Let's start the routed ganeti setup directly on Bookworm (IOW reimage ganeti2033/2024 after the move); the regular Ganeti clusters are still on Bullse... [11:44:57] (03PS4) 10Alexandros Kosiaris: Switch canaries to 1% OpenTelemetry sampling [puppet] - 10https://gerrit.wikimedia.org/r/984814 (https://phabricator.wikimedia.org/T351566) [11:45:46] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: [cumin] urllib >= 2 fails with the new internal certificates - https://phabricator.wikimedia.org/T354415 (10dcaro) [11:49:21] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: [cumin] urllib >= 2 fails to disable warning SubjectAltNameWarning as exception is not there anymore - https://phabricator.wikimedia.org/T354415 (10dcaro) [11:49:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reboot-single for host mw1379.eqiad.wmnet [11:50:20] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: [cumin] urllib >= 2 fails to disable warning SubjectAltNameWarning as exception is not there anymore - https://phabricator.wikimedia.org/T354415 (10dcaro) We have that usually configured in the spicerack config: ` 9 puppetdb:... [11:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:56:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1379.eqiad.wmnet [11:58:17] (KubernetesRsyslogDown) firing: (6) rsyslog on mw1378:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [12:06:18] PROBLEM - Host mw1379 is DOWN: PING CRITICAL - Packet loss = 100% [12:08:24] RECOVERY - Host mw1379 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [12:10:13] (KubernetesCalicoDown) firing: mw1379.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1379.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:15:12] (KubernetesCalicoDown) resolved: mw1379.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1379.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:21:02] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:38] PROBLEM - Host mw1379 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:58] RECOVERY - Host mw1379 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [12:42:53] (03PS1) 10Urbanecm: beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) [12:42:55] (03PS1) 10Urbanecm: beta: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) [12:43:16] (03CR) 10Urbanecm: [C: 04-2] "not now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [12:43:20] (03CR) 10Urbanecm: [C: 04-2] "not now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [12:43:35] (03CR) 10CI reject: [V: 04-1] beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [12:44:22] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:59] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Will do, thanks. For prefix allocation I'm suggesting the following, let me know what you think (especially @cmooney ! ) * eqiad * 10.64.24.0/23 - private1-v... [12:54:02] (03CR) 10Jelto: [C: 03+2] miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [12:55:44] (03Merged) 10jenkins-bot: miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [13:22:53] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10collaboration-services: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10LSobanski) a:05Arnoldokoth→03None [13:23:25] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:23:59] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:34:12] (03PS1) 10Jelto: miscweb: remove ingress match in design-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/987966 (https://phabricator.wikimedia.org/T350791) [13:34:38] (03PS1) 10Santiago Faci: deploying a new edit-analytics version to staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987967 (https://phabricator.wikimedia.org/T354074) [13:37:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:38:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:38:48] (03CR) 10Jelto: [C: 03+2] "per discussion in IRC I'll try this config in wikikube staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/987966 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [13:39:15] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:39:43] (03Merged) 10jenkins-bot: miscweb: remove ingress match in design-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/987966 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [13:41:57] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:42:19] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:50:08] (03PS1) 10Jelto: miscweb: use design.wikimedia.org instead of design.wikipedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/987969 (https://phabricator.wikimedia.org/T350791) [14:09:23] (03CR) 10JMeybohm: [C: 03+1] "🙈" [deployment-charts] - 10https://gerrit.wikimedia.org/r/987969 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:10:44] (03CR) 10Jelto: [C: 03+2] miscweb: use design.wikimedia.org instead of design.wikipedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/987969 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:10:51] 10SRE, 10SRE-tools, 10DBA, 10Data-Platform-SRE, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10BTullis) a:03BTullis I've got no problem with this. I think that I can run the **rollback** steps from T349619. [14:11:35] (03CR) 10Dzahn: [C: 03+1] miscweb: use design.wikimedia.org instead of design.wikipedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/987969 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:11:45] (03Merged) 10jenkins-bot: miscweb: use design.wikimedia.org instead of design.wikipedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/987969 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:12:03] (03CR) 10Milimetric: [C: 03+2] deploying a new edit-analytics version to staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987967 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [14:12:05] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10BTullis) [14:12:14] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10Marostegui) I am not sure if that'll bring us everything back or we'll need to do something with the mariadb certificates too cc @ABran-WMF [14:12:54] (03Merged) 10jenkins-bot: deploying a new edit-analytics version to staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987967 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [14:14:43] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:14:57] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:19:59] 10SRE, 10Thumbor, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint 4 (8th Jan.‘24 - 19th Jan.'24)): Error creating thumbnail: Unknown option --no-external-files - https://phabricator.wikimedia.org/T354407 (10kostajh) Adding #sre in case folks there have an idea of what mig... [14:32:29] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987973 (https://phabricator.wikimedia.org/T353460) [14:32:41] (03PS1) 10Jelto: miscweb: add httproutes with match and route to landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/987974 (https://phabricator.wikimedia.org/T350791) [14:32:51] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987973 (https://phabricator.wikimedia.org/T353460) (owner: 10Peter Fischer) [14:34:19] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987973 (https://phabricator.wikimedia.org/T353460) (owner: 10Peter Fischer) [14:37:11] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:26] (03CR) 10Ottomata: webrequest varnishkafka - Add to X-Analytics the Sec-Purpose HTTP header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [14:37:59] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:38:02] (03CR) 10Ottomata: [C: 04-1] "-1 until we resolve Adam's comment. (if i'm not around and it is resolved please remove my -1)." [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [14:38:36] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:41:10] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:41:25] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:42:50] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:43:39] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:44:58] (03CR) 10JMeybohm: [C: 03+1] miscweb: add httproutes with match and route to landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/987974 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:45:08] !log milimetric@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [14:45:29] (03CR) 10Eevans: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/987718 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [14:45:35] !log milimetric@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [14:46:32] (03CR) 10Jelto: [C: 03+2] miscweb: add httproutes with match and route to landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/987974 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:47:34] (03Merged) 10jenkins-bot: miscweb: add httproutes with match and route to landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/987974 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:48:43] 10sre-alert-triage, 10cloud-services-team: Alert in need of triage: ExporterUnavailable - https://phabricator.wikimedia.org/T354421 (10LSobanski) [14:50:23] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:50:34] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:53:32] (03PS1) 10Kamila Součková: mw-api-int: replicas x1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) [14:54:20] (03CR) 10Kamila Součková: [C: 04-1] "do not merge before adding capacity to eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/987976 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [14:57:11] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:43] (03PS12) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [15:00:22] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:00:25] (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:14:29] (03PS1) 10Jelto: miscweb: fix existingGatewayName and remove routeHosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/987979 (https://phabricator.wikimedia.org/T350791) [15:26:46] (03CR) 10JMeybohm: [C: 03+1] "getting there 😄" [deployment-charts] - 10https://gerrit.wikimedia.org/r/987979 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [15:27:23] (03CR) 10Jelto: [C: 03+2] miscweb: fix existingGatewayName and remove routeHosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/987979 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [15:28:39] (03Merged) 10jenkins-bot: miscweb: fix existingGatewayName and remove routeHosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/987979 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [15:30:07] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:31:06] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:31:45] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [15:32:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/980824 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi) [15:32:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox] - 10https://gerrit.wikimedia.org/r/980908 (https://phabricator.wikimedia.org/T310717) (owner: 10Ayounsi) [15:32:52] (03CR) 10Jbond: [C: 03+1] Bump standards version [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981293 (owner: 10Muehlenhoff) [15:33:02] (03PS1) 10Kosta Harlan: mediamoderation: Switch to using all.dblist [puppet] - 10https://gerrit.wikimedia.org/r/987983 (https://phabricator.wikimedia.org/T353703) [15:33:06] (03CR) 10Jbond: [C: 03+1] Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:35:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:40:10] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:40:29] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:40:39] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:40:48] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:41:43] (03CR) 10Jbond: [C: 04-1] "-1: This is used by CI to ensure that rspec tests can lookup [mocked] hiera data in the private repo. See the file below which is the hie" [puppet] - 10https://gerrit.wikimedia.org/r/984871 (owner: 10JHathaway) [15:48:25] (03CR) 10Jbond: "The module hasn't been updated in 8 years so i think its abandoned. Also the only code here is the quota.rb file which is a very rough ex" [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [15:49:43] (03CR) 10Jbond: [C: 04-1] "-1: is for not using g10k" [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [15:53:17] (KubernetesRsyslogDown) firing: (5) rsyslog on mw1378:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:53:44] PROBLEM - Host mw1383 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:24] RECOVERY - Host mw1383 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [15:59:00] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10jbond) >>! In T352974#9392688, @ABran-WMF wrote: > it appears that most of our hosts are still using `/etc/ssl/certs/Puppet_Internal_CA.pem` and... [15:59:12] (KubernetesCalicoDown) firing: mw1383.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1383.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:04:12] (KubernetesCalicoDown) resolved: mw1383.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1383.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:05:10] (03PS1) 10JMeybohm: miscweb: Fix destination host for design-strategy route [deployment-charts] - 10https://gerrit.wikimedia.org/r/988006 (https://phabricator.wikimedia.org/T350791) [16:05:12] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10andrea.denisse) 05In progress→03Resolved [16:07:16] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10andrea.denisse) a:03andrea.denisse [16:08:13] (03PS2) 10JMeybohm: miscweb: Fix destination host for design-strategy route [deployment-charts] - 10https://gerrit.wikimedia.org/r/988006 (https://phabricator.wikimedia.org/T350791) [16:13:22] (03CR) 10JHathaway: "The module is a bit dated, for example using the params.pp pattern for defaults, but otherwise the code is pretty short and succinct. I th" [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [16:16:02] (03CR) 10Jelto: [C: 03+1] "thanks for spotting this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988006 (https://phabricator.wikimedia.org/T350791) (owner: 10JMeybohm) [16:16:14] (03CR) 10Jelto: [C: 03+2] miscweb: Fix destination host for design-strategy route [deployment-charts] - 10https://gerrit.wikimedia.org/r/988006 (https://phabricator.wikimedia.org/T350791) (owner: 10JMeybohm) [16:17:19] (03CR) 10JHathaway: puppet: add quota module to vendor_modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [16:17:25] (03Merged) 10jenkins-bot: miscweb: Fix destination host for design-strategy route [deployment-charts] - 10https://gerrit.wikimedia.org/r/988006 (https://phabricator.wikimedia.org/T350791) (owner: 10JMeybohm) [16:18:04] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:17] (KubernetesRsyslogDown) firing: (4) rsyslog on mw1378:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:19:11] (03PS2) 10Bking: aptrepo: add Elastic-related components to bookworm repo [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) [16:19:26] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:20:44] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [16:25:49] (03PS5) 10Cwhite: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [16:27:43] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) >>! In T269328#9418126, @ayounsi wrote: > @Eevans reviving this years old thread now that Cassandra has been upgraded to 4.x since a few months. Would it be possible to look into not usi... [16:29:48] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:30:14] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:31:15] (03CR) 10Jbond: [C: 04-1] puppet: add quota module to vendor_modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [16:32:27] (03CR) 10Jbond: [C: 04-1] "FTR the -1 is just for the Puppetfile/g10k comment" [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [16:32:29] (03CR) 10CI reject: [V: 04-1] elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [16:32:34] (HelmReleaseBadStatus) firing: (2) Helm release miscweb/design-blog on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:38:32] (03CR) 10Bking: aptrepo: add Elastic-related components to bookworm repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [16:40:34] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:42:34] (HelmReleaseBadStatus) resolved: (2) Helm release miscweb/design-blog on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:42:57] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:43:13] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:48:26] (03CR) 10JHathaway: rake: remove cloning of private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984871 (owner: 10JHathaway) [16:48:55] (03Abandoned) 10JHathaway: rake: remove cloning of private repo [puppet] - 10https://gerrit.wikimedia.org/r/984871 (owner: 10JHathaway) [16:49:16] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:32] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:52:26] (03PS1) 10JHathaway: rakefile: improve docs [puppet] - 10https://gerrit.wikimedia.org/r/988015 [16:54:41] (03CR) 10JHathaway: puppet: add quota module to vendor_modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [16:55:09] (03CR) 10JHathaway: [C: 03+2] rakefile: improve docs [puppet] - 10https://gerrit.wikimedia.org/r/988015 (owner: 10JHathaway) [17:01:23] (03Abandoned) 10Kamila Součková: TEMPORARY changes for debugging T354413 [puppet] - 10https://gerrit.wikimedia.org/r/987960 (owner: 10Kamila Součková) [17:06:45] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/986838 [17:07:02] (03CR) 10Cwhite: Increase socat 6to4 buffer size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987879 (https://phabricator.wikimedia.org/T353220) (owner: 10Tim Starling) [17:07:44] (03CR) 10Muehlenhoff: aptrepo: add Elastic-related components to bookworm repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [17:08:00] (03PS2) 10Cwhite: Increase socat 6to4 buffer size [puppet] - 10https://gerrit.wikimedia.org/r/987879 (https://phabricator.wikimedia.org/T353220) (owner: 10Tim Starling) [17:08:25] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988017 (https://phabricator.wikimedia.org/T353460) [17:09:54] (03CR) 10Cwhite: [C: 03+2] "Lets try it!" [puppet] - 10https://gerrit.wikimedia.org/r/987879 (https://phabricator.wikimedia.org/T353220) (owner: 10Tim Starling) [17:11:32] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988017 (https://phabricator.wikimedia.org/T353460) (owner: 10Peter Fischer) [17:12:20] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988017 (https://phabricator.wikimedia.org/T353460) (owner: 10Peter Fischer) [17:13:56] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:14:44] (03CR) 10Jbond: rake: remove cloning of private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984871 (owner: 10JHathaway) [17:14:47] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:25:21] (03CR) 10JHathaway: rake: remove cloning of private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984871 (owner: 10JHathaway) [17:35:15] (03PS1) 10JHathaway: puppet-lint-wmf_styleguide-check: bump to 1.1.4 [puppet] - 10https://gerrit.wikimedia.org/r/988021 [17:39:15] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:41:01] (03CR) 10JHathaway: [C: 03+2] puppet-lint-wmf_styleguide-check: bump to 1.1.4 [puppet] - 10https://gerrit.wikimedia.org/r/988021 (owner: 10JHathaway) [17:45:08] (03PS1) 10FNegri: wmcs_wheel_of_misfortune: exclude uid<=1000 [puppet] - 10https://gerrit.wikimedia.org/r/988024 (https://phabricator.wikimedia.org/T354430) [17:49:08] (03PS3) 10Bking: aptrepo: add Elastic-related components to bookworm repo [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) [17:51:02] (03CR) 10Bking: aptrepo: add Elastic-related components to bookworm repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [17:51:21] (03PS2) 10FNegri: wmcs_wheel_of_misfortune: exclude uid<1000 [puppet] - 10https://gerrit.wikimedia.org/r/988024 (https://phabricator.wikimedia.org/T354430) [18:30:25] (03CR) 10BryanDavis: [C: 03+1] "Add another tally to the "problems caused by poettering deciding all people who came before him are wrong" board. :/" [puppet] - 10https://gerrit.wikimedia.org/r/988024 (https://phabricator.wikimedia.org/T354430) (owner: 10FNegri) [18:47:05] (03PS1) 10Andrew Bogott: mwopenstackclients.py: remove a use of project_name [puppet] - 10https://gerrit.wikimedia.org/r/988050 (https://phabricator.wikimedia.org/T343158) [18:47:07] (03PS1) 10Andrew Bogott: cloud-vps puppet encapi: use project_id instead of project_name for keystone [puppet] - 10https://gerrit.wikimedia.org/r/988051 (https://phabricator.wikimedia.org/T343158) [18:47:09] (03PS1) 10Andrew Bogott: Keystone: remove hack ensuring that project_id == project_name [puppet] - 10https://gerrit.wikimedia.org/r/988052 (https://phabricator.wikimedia.org/T343158) [19:04:38] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [19:04:40] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:41] !log vrts1001 - sudo systemctl start clamav-daemon [19:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:44] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [19:07:46] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:58] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase2034.codfw.wmnet [19:29:14] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase2034.codfw.wmnet [19:29:15] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2034.codfw.wmnet [19:29:50] (03PS1) 10Dzahn: cloud.yaml: set graphite_host to 'localhost' [puppet] - 10https://gerrit.wikimedia.org/r/988058 [19:31:03] (03PS5) 10Eevans: restbase: add missing keys & certs, remove obsolete [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) [19:32:21] (03CR) 10Eevans: [V: 03+2 C: 03+2] restbase: add missing keys & certs, remove obsolete [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [19:52:22] (03PS2) 10Dzahn: cloud.yaml: set graphite_host to 'graphite.invalid' [puppet] - 10https://gerrit.wikimedia.org/r/988058 [19:53:46] (03PS3) 10Dzahn: cloud.yaml: set graphite_host to 'graphite.wmcloud.invalid' [puppet] - 10https://gerrit.wikimedia.org/r/988058 [20:08:55] (03CR) 10Krinkle: [C: 03+1] add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [20:18:17] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:23:45] (03PS1) 10JHathaway: cloud: disable statsite in cloud [puppet] - 10https://gerrit.wikimedia.org/r/988076 [20:24:13] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/988076 (owner: 10JHathaway) [20:27:11] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:28:59] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:41:34] (03PS1) 10Dzahn: clamav: replace systemd unit file, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 [20:42:43] (03CR) 10CI reject: [V: 04-1] clamav: replace systemd unit file, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 (owner: 10Dzahn) [20:43:12] (03PS2) 10Dzahn: clamav: replace systemd unit file, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 [20:43:34] (03PS3) 10Dzahn: clamav: replace systemd unit file, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 [20:44:43] (03CR) 10CI reject: [V: 04-1] clamav: replace systemd unit file, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 (owner: 10Dzahn) [20:45:46] (03CR) 10Dzahn: "happened one more time today. how about this approach.. and let's just have systemd restart it on failure:" [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) (owner: 10AOkoth) [20:48:09] (03CR) 10Dzahn: [C: 03+1] cloud: disable statsite in cloud [puppet] - 10https://gerrit.wikimedia.org/r/988076 (owner: 10JHathaway) [20:48:41] (03CR) 10Dzahn: [C: 03+1] "replaces https://gerrit.wikimedia.org/r/c/operations/puppet/+/988058 which is where this came from" [puppet] - 10https://gerrit.wikimedia.org/r/988076 (owner: 10JHathaway) [20:48:52] (03Abandoned) 10Dzahn: cloud.yaml: set graphite_host to 'graphite.wmcloud.invalid' [puppet] - 10https://gerrit.wikimedia.org/r/988058 (owner: 10Dzahn) [20:50:08] (03PS4) 10Dzahn: clamav: replace systemd unit file, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 [20:50:35] (03PS5) 10Dzahn: clamav: replace systemd unit file, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 [20:52:19] (03CR) 10JHathaway: [C: 03+2] cloud: disable statsite in cloud [puppet] - 10https://gerrit.wikimedia.org/r/988076 (owner: 10JHathaway) [20:54:45] (03PS1) 10Dzahn: secret: delete fake keys for hosts in Tampa(!) [labs/private] - 10https://gerrit.wikimedia.org/r/988084 [20:57:24] (03PS1) 10Dzahn: secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 [21:14:13] (03PS1) 10Jdlrobson: Fix Special:ExternalGuidance [extensions/ExternalGuidance] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987994 (https://phabricator.wikimedia.org/T354404) [21:15:30] (03CR) 10Jdlrobson: "Test url: https://en.m.wikipedia.beta.wmflabs.org/wiki/Special:ExternalGuidance?from=en&to=fr&page=Foo&language=es&service=Google#/create-" [extensions/ExternalGuidance] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987994 (https://phabricator.wikimedia.org/T354404) (owner: 10Jdlrobson) [21:21:42] (03CR) 10Dzahn: [C: 04-1] "should use a dropin instead of overwriting the file" [puppet] - 10https://gerrit.wikimedia.org/r/988081 (owner: 10Dzahn) [21:38:46] (03PS6) 10Dzahn: clamav: add systemd override, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 [21:39:16] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:39:46] (03PS7) 10Dzahn: clamav: add systemd override, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 [21:41:12] (03PS2) 10Dzahn: secret: delete fake keys for hosts in Tampa(!) [labs/private] - 10https://gerrit.wikimedia.org/r/988084 [21:43:01] (03PS2) 10Dzahn: secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 (https://phabricator.wikimedia.org/T253555) [21:46:56] (03PS3) 10Dzahn: secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 (https://phabricator.wikimedia.org/T253555) [21:52:59] (03PS3) 10Dzahn: secret: delete fake keys for hosts in Tampa(!) [labs/private] - 10https://gerrit.wikimedia.org/r/988084 (https://phabricator.wikimedia.org/T84536) [22:05:06] (03PS1) 10Dzahn: phabricator: move logmail to a separate profile to simplify main profile [puppet] - 10https://gerrit.wikimedia.org/r/988106 [22:06:18] (03CR) 10CI reject: [V: 04-1] phabricator: move logmail to a separate profile to simplify main profile [puppet] - 10https://gerrit.wikimedia.org/r/988106 (owner: 10Dzahn) [22:15:01] (03CR) 10JHathaway: [C: 03+1] secret: delete fake keys for hosts in Tampa(!) [labs/private] - 10https://gerrit.wikimedia.org/r/988084 (https://phabricator.wikimedia.org/T84536) (owner: 10Dzahn) [22:18:14] (03PS1) 10Dzahn: phabricator: move data syncing related code to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/988107 (https://phabricator.wikimedia.org/T354221) [22:19:24] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) brett closed https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/3 Release 1.15.14 [22:20:16] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) brett reopened https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/3 Release 1.15.14 [22:20:45] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) brett closed https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/3 Release 1.15.14 [22:27:49] (03PS1) 10Sharvaniharan: New stream config for mobileapps Places feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988108 [22:29:16] (03PS2) 10Sharvaniharan: New stream config for mobileapps Places feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988108 (https://phabricator.wikimedia.org/T351165) [22:44:09] (03CR) 10Cwhite: [C: 03+1] prometheus: validate check Prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/987789 (owner: 10Filippo Giunchedi) [22:44:20] (03CR) 10Dreamy Jazz: [C: 03+1] mediamoderation: Switch to using all.dblist [puppet] - 10https://gerrit.wikimedia.org/r/987983 (https://phabricator.wikimedia.org/T353703) (owner: 10Kosta Harlan) [23:05:40] (03PS1) 10Thcipriani: Revert "Add a banner for the 2023 developer survey" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987995 [23:09:18] (03PS1) 10Dzahn: phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) [23:09:25] (03CR) 10Thcipriani: [C: 03+2] Revert "Add a banner for the 2023 developer survey" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987995 (owner: 10Thcipriani) [23:10:25] (03Merged) 10jenkins-bot: Revert "Add a banner for the 2023 developer survey" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987995 (owner: 10Thcipriani) [23:10:51] (03CR) 10CI reject: [V: 04-1] phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [23:14:39] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/987945 (owner: 10Muehlenhoff) [23:16:53] (03PS1) 10Dzahn: phabricator: avoid duplicate list of server names in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/988112 (https://phabricator.wikimedia.org/T354221) [23:23:51] (03PS1) 10Dzahn: phabricator: add key for secondary server and create combined list [puppet] - 10https://gerrit.wikimedia.org/r/988113 (https://phabricator.wikimedia.org/T354221) [23:25:39] !log deploying gerrit to remove survey banner https://gerrit.wikimedia.org/r/987995 (no downtime needed) [23:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:10] (03PS1) 10Htriedman: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 [23:31:49] !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@de3a994]: Removing survey banner [[gerrit:987995]] (gerrit-replicas only this deploy) [23:31:56] !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@de3a994]: Removing survey banner [[gerrit:987995]] (gerrit-replicas only this deploy) (duration: 00m 06s) [23:43:02] (03PS2) 10Dzahn: phabricator: move logmail to a separate profile to simplify main profile [puppet] - 10https://gerrit.wikimedia.org/r/988106 [23:48:31] (03PS1) 10Dzahn: phabricator: move prometheus smtp check to monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/988116 [23:49:26] !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@de3a994]: Removing survey banner [[gerrit:987995]] (gerrit.wikimedia.org only this deploy) [23:49:34] !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@de3a994]: Removing survey banner [[gerrit:987995]] (gerrit.wikimedia.org only this deploy) (duration: 00m 08s) [23:50:20] (03CR) 10Dzahn: "more "active_server" stuff here. and the question if we should monitor smtp on both or just active" [puppet] - 10https://gerrit.wikimedia.org/r/988116 (owner: 10Dzahn) [23:52:07] (03PS3) 10Dzahn: phabricator: move logmail to a separate profile to simplify main profile [puppet] - 10https://gerrit.wikimedia.org/r/988106