[00:18:31] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [00:34:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [00:38:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983235 [00:38:24] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983235 (owner: 10TrainBranchBot) [00:52:23] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [00:56:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:56:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:57:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:57:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:57:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983235 (owner: 10TrainBranchBot) [01:08:29] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [02:00:45] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:59] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [02:18:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [02:28:11] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:33:11] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:36:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:36:54] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:44:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:04:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:06:54] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:12:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:17:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:38:31] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:48:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:13:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:15:07] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:31] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [05:04:36] (03CR) 10Anzx: "To avoid breaking existing links, the old namespace names could be added as wgNamespaceAliases." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) (owner: 10Strainu) [05:08:29] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:37:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:37:35] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:37:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:38:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:38:55] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:38:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51008 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:09:06] (03PS1) 10Marostegui: installserver: Do not format db1240 [puppet] - 10https://gerrit.wikimedia.org/r/983564 [06:11:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:13:41] (03CR) 10Marostegui: [C: 03+2] installserver: Do not format db1240 [puppet] - 10https://gerrit.wikimedia.org/r/983564 (owner: 10Marostegui) [06:18:49] (03PS3) 10KartikMistry: Update MinT to 2023-12-12-065316-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/982645 [06:26:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:27:13] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:27:25] (03PS1) 10Marostegui: db2131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/983565 [06:28:12] (03CR) 10Marostegui: [C: 03+2] db2131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/983565 (owner: 10Marostegui) [06:28:14] 10SRE, 10Maps, 10Traffic: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Nicolas_Raoul) Any blocker for approval? We have already implemented the code in the app (switched from Mapbox to osmdroid library), but are waiting for your approval here bef... [06:36:24] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:00:39] (03CR) 10Marostegui: Add compare tables periodic job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [07:21:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:31:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:38:31] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:46:13] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:57:37] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:13] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:06] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T0800). [08:00:06] chlod and Sohom_Datta: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:20] o/ here [08:01:18] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10ayounsi) Nice ! So next step here is to decom the current ones and then sync up with DCops to move them to the proper racks. From there we can re-pr... [08:01:51] o/ [08:01:59] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:47] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:08:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:09:35] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:18] (03PS1) 10Muehlenhoff: Set Puppet 7 per role [puppet] - 10https://gerrit.wikimedia.org/r/983669 (https://phabricator.wikimedia.org/T349619) [08:11:59] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:14:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:14:13] (SystemdUnitFailed) firing: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:29] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: user@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:31] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [08:19:13] (SystemdUnitFailed) firing: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:04] (03CR) 10Filippo Giunchedi: [C: 04-1] "I promptly forgot to send this draft comment" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [08:26:59] (SystemdUnitFailed) resolved: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983481 (owner: 10Dzahn) [08:28:57] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:13] (SystemdUnitFailed) firing: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:21] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:34:14] (SystemdUnitFailed) resolved: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:59] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:12] (03CR) 10Muehlenhoff: [C: 03+2] librenms: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983420 (owner: 10Muehlenhoff) [08:35:33] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:36:33] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: user@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:13] (SystemdUnitFailed) firing: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:17] (03PS1) 10Muehlenhoff: base::cuminunpriv: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983671 [08:42:27] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:13] (SystemdUnitFailed) resolved: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983671 (owner: 10Muehlenhoff) [08:47:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [08:49:11] (03CR) 10Muehlenhoff: [C: 04-1] "Per https://office.wikimedia.org/wiki/Contact_list#Collaboration_Services the team is now called Collaboration Services, not Collaborative" [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [08:52:39] (03CR) 10Muehlenhoff: [C: 03+2] prometheus::snmp_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983425 (owner: 10Muehlenhoff) [08:54:07] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:54:37] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:13] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:57] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:23] (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring Add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983131 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:03:53] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:05:02] (03PS1) 10Vgutierrez: hiera: Add acmechief1002 to the list of acme-chief passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/983675 (https://phabricator.wikimedia.org/T352242) [09:07:11] (03CR) 10Sohom Datta: [C: 03+1] Revert "util.main: Don't use mw.Map(), use a native Map() instead" [extensions/PageTriage] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983529 (https://phabricator.wikimedia.org/T353571) (owner: 10Chlod Alejandro) [09:07:33] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/940/con" [puppet] - 10https://gerrit.wikimedia.org/r/983675 (https://phabricator.wikimedia.org/T352242) (owner: 10Vgutierrez) [09:08:26] (03CR) 10Brouberol: [C: 03+2] yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:08:30] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:09:00] err I guess I'll take care of that too :) [09:10:05] !log vgutierrez@acmechief1002:~$ sudo -i keyholder arm - T352242 [09:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:10] T352242: Provide second acmechief server configured for Puppet 7 in eqiad - https://phabricator.wikimedia.org/T352242 [09:11:55] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:11:59] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Add acmechief1002 to the list of acme-chief passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/983675 (https://phabricator.wikimedia.org/T352242) (owner: 10Vgutierrez) [09:15:03] RECOVERY - Ensure that passive node gets the certificates from the active node as expected on acmechief1002 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.status is 22 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [09:15:11] moritzm: ^^ [09:16:27] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:29] (03CR) 10Strainu: [namespaces] Use correct diacritics in Romanian (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) (owner: 10Strainu) [09:17:09] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:20:35] (03CR) 10Btullis: [C: 03+1] yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:21:39] (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the refinery version used by the refine production jobs [puppet] - 10https://gerrit.wikimedia.org/r/980923 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [09:21:54] (ProbeDown) firing: (6) Service puppetmaster1002:8141 has failed probes (http_puppetmaster1002_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:10] vgutierrez: cheers [09:28:55] (03PS8) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) [09:28:57] (03PS7) 10Brouberol: spark3: set the spark history server domain for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [09:35:23] (03Abandoned) 10Muehlenhoff: superset: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/979368 (owner: 10Muehlenhoff) [09:41:48] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [09:45:29] !log make eqiad-codfw 100G link primary [09:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:06] (03CR) 10Muehlenhoff: [C: 03+2] tlsproxy::envoy: Only pass an srange if not an empty array [puppet] - 10https://gerrit.wikimedia.org/r/982428 (owner: 10Muehlenhoff) [09:51:22] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/941/con" [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:51:54] (ProbeDown) firing: (2) Service puppetmaster2003:8141 has failed probes (http_puppetmaster2003_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:51] (03PS3) 10Muehlenhoff: testreduce: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/982419 [09:52:56] (03PS1) 10Slyngshede: C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) [09:53:12] (03PS8) 10Brouberol: spark3: set the spark history server domain for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [09:54:55] (03CR) 10JMeybohm: [C: 03+1] "love it! 😜" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983451 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [09:55:13] (03PS9) 10Brouberol: spark3: set the spark history server domain for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [09:56:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982419 (owner: 10Muehlenhoff) [09:56:57] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/942/con" [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:57:34] (03CR) 10Brouberol: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:58:01] (03PS3) 10Muehlenhoff: rsync::server: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/983437 [09:59:13] (03PS2) 10Slyngshede: C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) [09:59:53] (03CR) 10CI reject: [V: 04-1] C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:00:11] (03PS3) 10Slyngshede: C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) [10:00:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983437 (owner: 10Muehlenhoff) [10:00:54] (03CR) 10CI reject: [V: 04-1] C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:01:23] (03PS4) 10Slyngshede: C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) [10:02:04] (03CR) 10JMeybohm: [V: 03+1] pki::multirootca: Override the server profiles expiry for k8s staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983363 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [10:02:06] (03CR) 10CI reject: [V: 04-1] C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:02:17] (03PS10) 10Brouberol: spark3: set the spark history server domain for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) [10:03:33] (03PS5) 10Slyngshede: C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) [10:04:07] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [10:04:23] (03CR) 10CI reject: [V: 04-1] C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:07:17] (03CR) 10Ayounsi: [C: 03+2] Remove load-balancing VRRP master pinning [homer/public] - 10https://gerrit.wikimedia.org/r/983143 (https://phabricator.wikimedia.org/T307551) (owner: 10Ayounsi) [10:08:10] (03Merged) 10jenkins-bot: Remove load-balancing VRRP master pinning [homer/public] - 10https://gerrit.wikimedia.org/r/983143 (https://phabricator.wikimedia.org/T307551) (owner: 10Ayounsi) [10:09:25] !log installing Linux 6.1.67 updates on Bookworm hosts [10:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:59] !log remove VRRP pinning on cr1-eqiad/cr2-eqiad/cr2-codfw [10:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:29] (03Abandoned) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [10:13:36] (03Restored) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [10:14:51] (03CR) 10Btullis: [C: 03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [10:15:05] (03CR) 10Brouberol: [C: 03+2] spark3: set the spark history server domain for analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/983192 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [10:17:42] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [10:24:16] (03PS6) 10Slyngshede: C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) [10:25:38] (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring use FQDN for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/983677 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:26:23] (03CR) 10Volans: [C: 03+2] reports: network, remove rdb from no IPv6 list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/979399 (https://phabricator.wikimedia.org/T271142) (owner: 10Volans) [10:27:01] (03Merged) 10jenkins-bot: reports: network, remove rdb from no IPv6 list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/979399 (https://phabricator.wikimedia.org/T271142) (owner: 10Volans) [10:29:47] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:29:53] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:33:59] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetmaster2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:39:50] !log installing jetty9 security updates [10:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:10] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [10:47:48] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [10:51:09] (03PS1) 10Slyngshede: C:puppetmaster::monitoring remove blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/983683 (https://phabricator.wikimedia.org/T350694) [10:51:55] (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring remove blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/983683 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:52:16] !log installing gnutls28 security updates [10:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:24] (03CR) 10Volans: [C: 04-1] "I think there are some things to fix. See inline for the details." [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:53:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:56:04] !log restarting apache/FPM on mw canaries to pick up gnutls update [10:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:51] slyngs: are you already fixing the puppet alerts on puppemaster1001 ? [10:56:58] Yes [10:57:02] ack thx [10:57:11] Already fixed on 1001 [10:57:20] Just running an update on 2001 now [10:57:27] ok [10:59:25] Both are happy again, sorry [11:00:06] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T1100) [11:04:00] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:04:06] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.6.5 [software/homer] - 10https://gerrit.wikimedia.org/r/983398 (owner: 10Ayounsi) [11:06:25] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.5 [software/homer] - 10https://gerrit.wikimedia.org/r/983398 (owner: 10Ayounsi) [11:16:59] (03PS4) 10Hnowlan: rest-gateway: add varnish- and trafficserver-side mangling to rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/949846 (https://phabricator.wikimedia.org/T344358) [11:18:59] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:21:54] (ProbeDown) resolved: (2) Service puppetmaster2003:8141 has failed probes (http_puppetmaster2003_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:00] (03CR) 10Elukey: [C: 03+2] recommendation-api: update monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [11:24:12] (03PS1) 10Muehlenhoff: failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687 [11:24:22] (03PS2) 10Muehlenhoff: failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687 [11:24:53] (03CR) 10CI reject: [V: 04-1] failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687 (owner: 10Muehlenhoff) [11:25:02] (03Merged) 10jenkins-bot: recommendation-api: update monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [11:33:35] (03PS1) 10Muehlenhoff: check_wmf_styleguide: Remove check to enforce presence of system::role [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/983689 [11:36:03] !log fabfur@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS bullseye [11:36:37] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [11:37:04] !log fabfur@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS bullseye [11:37:24] (03CR) 10Elukey: [C: 03+2] services: deploy the new rec-api-ng Docker image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983404 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [11:38:02] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [11:38:31] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:39:31] !log installing qemu security updates on bookworm [11:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:15] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync [11:41:32] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [11:46:09] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [11:50:22] (03CR) 10Kamila Součková: [C: 03+2] kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [11:50:55] (03CR) 10Volans: "Just some early comments, I still need to do a full pass." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) (owner: 10Cathal Mooney) [11:51:19] !log fabfur@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS bullseye [11:51:31] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [11:52:29] (03CR) 10Majavah: [C: 03+1] [toolsdb] Use jemalloc to prevent memory issues [puppet] - 10https://gerrit.wikimedia.org/r/983513 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [11:53:32] (03Merged) 10jenkins-bot: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [11:58:49] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:59:17] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [11:59:26] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:01:24] !log installing ncurses security updates [12:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:51] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:04:24] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:04:42] (03PS1) 10Majavah: P:toolforge::prometheus: add recording rules for k8s cluster capacity [puppet] - 10https://gerrit.wikimedia.org/r/983692 (https://phabricator.wikimedia.org/T352581) [12:06:08] (03PS2) 10Majavah: P:toolforge::prometheus: add recording rules for k8s cluster capacity [puppet] - 10https://gerrit.wikimedia.org/r/983692 (https://phabricator.wikimedia.org/T352581) [12:07:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/944/con" [puppet] - 10https://gerrit.wikimedia.org/r/983692 (https://phabricator.wikimedia.org/T352581) (owner: 10Majavah) [12:08:01] (03PS1) 10Elukey: services: update Docker image and settings for Recommendation API [deployment-charts] - 10https://gerrit.wikimedia.org/r/983694 (https://phabricator.wikimedia.org/T349118) [12:09:59] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:10:35] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:11:16] (03PS5) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [12:12:46] !log restart swift-proxy and envoyproxy on ms-fe1012 [12:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:36] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:14:12] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:17:12] (03CR) 10Kamila Součková: [C: 03+1] [aux-k8s-eqiad] add kube-state-metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) (owner: 10CDanis) [12:18:29] !log kamila@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:18:31] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [12:19:16] !log kamila@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:19:49] !log fabfur@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS bullseye [12:19:49] (03PS1) 10MVernon: roll-restart-reboot-swift-ms-proxies: restart envoyproxy not nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/983699 [12:20:15] !log kamila@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:20:18] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [12:20:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] services: update Docker image and settings for Recommendation API [deployment-charts] - 10https://gerrit.wikimedia.org/r/983694 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [12:20:52] !log kamila@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:23:47] !log kamila@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:24:28] !log kamila@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:25:53] !log kamila@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:26:31] !log kamila@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:26:52] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-canary [12:27:00] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/983236 [12:27:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-canary [12:28:50] (03CR) 10MVernon: "Tested with:" [cookbooks] - 10https://gerrit.wikimedia.org/r/983699 (owner: 10MVernon) [12:32:04] (03CR) 10Arnaudb: [C: 03+1] roll-restart-reboot-swift-ms-proxies: restart envoyproxy not nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/983699 (owner: 10MVernon) [12:33:55] (03CR) 10MVernon: [C: 03+2] roll-restart-reboot-swift-ms-proxies: restart envoyproxy not nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/983699 (owner: 10MVernon) [12:39:42] (03PS6) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [12:41:34] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [12:42:43] (03CR) 10Majavah: "This feels like a dangerous way to handle things: if you're for example aliasing the firewall_srange parameter to an another key with a li" [puppet] - 10https://gerrit.wikimedia.org/r/982428 (owner: 10Muehlenhoff) [12:43:14] (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 75% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976223 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [12:43:35] (03PS7) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [12:45:03] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [12:48:03] (03PS1) 10Brouberol: yarn: configure Apache to only listen to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/983712 (https://phabricator.wikimedia.org/T352863) [12:49:00] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983712 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [12:50:15] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:50:38] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:51:38] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:51:50] (03CR) 10Cathal Mooney: "Thanks for the review! Yeah I'm sure there will be improvements. I don't like having the big dict in the file either but found it easier" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) (owner: 10Cathal Mooney) [12:52:31] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:52:56] (03PS1) 10Slyngshede: C:puppetmaster::monitoring Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983713 (https://phabricator.wikimedia.org/T350694) [12:53:31] (03PS8) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [12:55:31] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:56:14] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:56:37] (03PS2) 10Kamila Součková: mobileapps: 90% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976224 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [13:06:57] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:08:07] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4037.ulsfo.wmnet with OS bullseye [13:08:38] (03CR) 10LSobanski: [C: 03+1] roles/hieradata: rename serviceops-collab team [puppet] - 10https://gerrit.wikimedia.org/r/983481 (owner: 10Dzahn) [13:15:36] !log installing intel-microcode security updates on buster hosts [13:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:35] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983671 (owner: 10Muehlenhoff) [13:22:29] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/946/console" [puppet] - 10https://gerrit.wikimedia.org/r/983713 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:23:30] (03CR) 10Muehlenhoff: [C: 03+2] base::cuminunpriv: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/983671 (owner: 10Muehlenhoff) [13:24:36] (03PS1) 10Ayounsi: Netbox: use standard STORAGE_BACKEND/CONFIG keys [puppet] - 10https://gerrit.wikimedia.org/r/983716 (https://phabricator.wikimedia.org/T310717) [13:29:48] (03PS2) 10Slyngshede: C:puppetmaster::monitoring Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983713 (https://phabricator.wikimedia.org/T350694) [13:31:22] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/949/console" [puppet] - 10https://gerrit.wikimedia.org/r/983713 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:56:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:58:27] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T1400). nyaa~ [14:00:04] Dreamy_Jazz, cwhite, milkydefer, aiko, and MPGuy2824: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] \o [14:01:28] \o [14:01:38] \o [14:01:59] o/ [14:02:27] (03PS8) 10Brouberol: yarn: configure Apache to only listen to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/983712 (https://phabricator.wikimedia.org/T352863) [14:04:20] (03PS9) 10Brouberol: yarn: configure Apache to only listen to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/983712 (https://phabricator.wikimedia.org/T352863) [14:05:30] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/953/con" [puppet] - 10https://gerrit.wikimedia.org/r/983712 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:05:35] (03CR) 10Btullis: [C: 03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/983712 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:05:46] Any deployers around for this window? [14:07:41] (03CR) 10Brouberol: [V: 03+1 C: 03+2] yarn: configure Apache to only listen to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/983712 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:11:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983713 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:13:01] !log installing node-undici security updates [14:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:16] (03CR) 10Volans: "LGTM, two small minor fixes and should be ready!" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [14:16:28] (03CR) 10Volans: [C: 03+1] "LGTM, didn't test it with the required dependency change" [puppet] - 10https://gerrit.wikimedia.org/r/983716 (https://phabricator.wikimedia.org/T310717) (owner: 10Ayounsi) [14:16:49] o/ same question [14:19:59] (03CR) 10Filippo Giunchedi: [C: 04-1] "Overall LGTM, haven't tried building the package though, please use gitlab going forward. With gitlab we can also use CI to build debian p" [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 (owner: 10Herron) [14:20:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime? [14:20:25] okay, let's deploy, since no one else's here [14:20:35] Thanks! [14:20:37] (03PS2) 10Urbanecm: CheckUser: Enable read new for event tables migration everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983178 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:20:40] (03CR) 10Urbanecm: [C: 03+2] CheckUser: Enable read new for event tables migration everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983178 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:20:45] Apologies if the second ping was annoying :) [14:20:53] no worries [14:21:02] (03CR) 10Filippo Giunchedi: [C: 04-1] "Same as verlib2, LGTM overall and I haven't tried building the package, please use gitlab" [debs/python-grafana-client] - 10https://gerrit.wikimedia.org/r/983477 (owner: 10Herron) [14:21:14] I will be able to test this change and shouldn't take too long to do so. [14:21:22] ack [14:21:47] (03Merged) 10jenkins-bot: CheckUser: Enable read new for event tables migration everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983178 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:23:04] aiko: hi! i see you have already +2'ed your change. please note that a +2 for operations/mediawiki-config changes means "I am going to personally deploy this right now". +2'ing a config change without actually doing the deployment generally confuses other deployers :). letting you know for your awareness. [14:23:41] (in this case, i'm going to do the deployment for you as the window deployer) [14:23:58] urbanecm: sorry about that! I didn't know it [14:23:59] (03CR) 10Urbanecm: [C: 03+2] Revert "util.main: Don't use mw.Map(), use a native Map() instead" [extensions/PageTriage] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983529 (https://phabricator.wikimedia.org/T353571) (owner: 10Chlod Alejandro) [14:24:03] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:982873|Add a testing stream for page-prediction-change events (T349919)]], [[gerrit:983178|CheckUser: Enable read new for event tables migration everywhere (T341829)]] [14:24:09] T349919: Apply common settings to publish events from Lift Wing staging to EventGate - https://phabricator.wikimedia.org/T349919 [14:24:09] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [14:24:42] aiko: it's a major difference between +2ing a config patch (or a deployment branch patch, ie. a patch for `wmf/*`) versus +2'ing a code patch :) [14:24:45] 10SRE: Decommission lists1003 - https://phabricator.wikimedia.org/T353647 (10MoritzMuehlenhoff) [14:24:53] (03CR) 10Andrew Bogott: [C: 03+1] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/983513 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [14:24:54] 10SRE: Decommission lists1003 - https://phabricator.wikimedia.org/T353647 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:25:04] 10SRE, 10Infrastructure-Foundations: Migrate Spicerack logs from cumin1001 to cumin1002? - https://phabricator.wikimedia.org/T353523 (10Volans) Yes, it would be nice to sync `/var/log/spicerack/` and `/var/log/cumin` from `cumin1001` to `cumin1002` when we stop using cumin1001. It's ok to have them in some `/v... [14:25:48] aiko: recommend https://wikitech.wikimedia.org/wiki/Deployments/Training in case you want to learn more about how deploying works (technically, you currently have permissions to do the deployment, too). [14:26:12] (03PS3) 10Urbanecm: Enable action blocks for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer) [14:26:16] (03CR) 10Urbanecm: [C: 03+2] Enable action blocks for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer) [14:26:17] urbanecm: ack, thank you! I will check it out [14:26:38] (03PS2) 10Ayounsi: Extend STORAGE_BACKEND config to support swift [software/netbox] - 10https://gerrit.wikimedia.org/r/980908 (https://phabricator.wikimedia.org/T310717) [14:26:48] np [14:27:16] (03Merged) 10jenkins-bot: Enable action blocks for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer) [14:28:27] um, what am I supposed to do now? [14:29:30] milkydefer: please wait for a ping from me to test your change in the debug environment :). not sure if you went through that process before – if not, you'll need the Wikimedia debug browser extension installed (https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage). [14:30:41] indeed that's my first patch. I have installed that extension beforehand. [14:31:44] milkydefer: okay, great. i'll ping you once the patch is testable, and if you have any questions, please do not hesitate to ask. [14:32:38] (03CR) 10Elukey: [C: 03+2] services: update Docker image and settings for Recommendation API [deployment-charts] - 10https://gerrit.wikimedia.org/r/983694 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [14:32:43] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Consider deprecation of WMF styleguide checks - https://phabricator.wikimedia.org/T353648 (10MoritzMuehlenhoff) [14:32:50] (03CR) 10Andrew Bogott: [C: 03+1] P:toolforge::prometheus: add recording rules for k8s cluster capacity [puppet] - 10https://gerrit.wikimedia.org/r/983692 (https://phabricator.wikimedia.org/T352581) (owner: 10Majavah) [14:33:55] (03CR) 10Bking: [C: 03+2] wdqs: New LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/983438 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [14:34:15] !log urbanecm@deploy2002 dreamyjazz and aikochou and urbanecm: Backport for [[gerrit:982873|Add a testing stream for page-prediction-change events (T349919)]], [[gerrit:983178|CheckUser: Enable read new for event tables migration everywhere (T341829)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:34:21] T349919: Apply common settings to publish events from Lift Wing staging to EventGate - https://phabricator.wikimedia.org/T349919 [14:34:21] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [14:34:25] Testing now. [14:34:31] Dreamy_Jazz: aiko: can you please test your changes at mwdebug? [14:35:47] (03CR) 10FNegri: [V: 03+1 C: 03+2] [toolsdb] Use jemalloc to prevent memory issues [puppet] - 10https://gerrit.wikimedia.org/r/983513 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [14:35:51] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [14:36:21] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [14:36:30] Nearly complete. Just need to check logstash. [14:36:34] ack [14:36:45] aiko: how are your tests going please? [14:36:54] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:06] Test successful. [14:37:20] ack, thanks. [14:37:21] urbanecm: we can proceed with the change [14:37:24] thanks, syncing. [14:37:25] !log urbanecm@deploy2002 dreamyjazz and aikochou and urbanecm: Continuing with sync [14:37:55] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) (owner: 10JMeybohm) [14:38:29] (03PS1) 10Btullis: Add superset namespaces to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/983718 (https://phabricator.wikimedia.org/T347710) [14:40:24] (03CR) 10Btullis: "I have added these namespace with the default settings, which meant that istio-injection is enabled and so is deployTLSCertificate." [deployment-charts] - 10https://gerrit.wikimedia.org/r/983718 (https://phabricator.wikimedia.org/T347710) (owner: 10Btullis) [14:42:31] (03CR) 10Elukey: Add more calico alerts (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) (owner: 10JMeybohm) [14:43:03] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:982873|Add a testing stream for page-prediction-change events (T349919)]], [[gerrit:983178|CheckUser: Enable read new for event tables migration everywhere (T341829)]] (duration: 19m 00s) [14:43:09] T349919: Apply common settings to publish events from Lift Wing staging to EventGate - https://phabricator.wikimedia.org/T349919 [14:43:09] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [14:43:10] Dreamy_Jazz: aiko: your changes are synced [14:43:24] Thanks! [14:43:29] thanks :) [14:43:32] (03Merged) 10jenkins-bot: Revert "util.main: Don't use mw.Map(), use a native Map() instead" [extensions/PageTriage] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983529 (https://phabricator.wikimedia.org/T353571) (owner: 10Chlod Alejandro) [14:43:37] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:981714|Enable action blocks for zhwiki (T353120)]] [14:43:41] T353120: Enable action blocks in Chinese Wikipedia - https://phabricator.wikimedia.org/T353120 [14:44:13] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:toolforge::prometheus: add recording rules for k8s cluster capacity [puppet] - 10https://gerrit.wikimedia.org/r/983692 (https://phabricator.wikimedia.org/T352581) (owner: 10Majavah) [14:44:14] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@d275e4f]: (no justification provided) [14:44:46] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@d275e4f]: (no justification provided) (duration: 00m 32s) [14:44:51] !log urbanecm@deploy2002 milkydefer and urbanecm: Backport for [[gerrit:981714|Enable action blocks for zhwiki (T353120)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:45:03] !log installing nagios-plugins-contrib bugfix updates [14:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:09] !log installing nagios-plugins-contrib bugfix updates from Bookworm point release [14:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:13] milkydefer: can you verify your patch at mwdebug2001, please? [14:45:33] you should only need to enable the Wikimedia debug extension, and trying to see whether action blocks work as you'd expect. [14:45:59] I am not an admin, I don't think I can actually perform the block [14:46:20] u could use apisandbox to check [14:46:34] or i can check myself [14:46:58] action blocks seem to be working [14:47:00] !log urbanecm@deploy2002 milkydefer and urbanecm: Continuing with sync [14:47:01] proceeding [14:47:20] yes, I checked apisandbox and that appears [14:47:37] (03CR) 10Bartosz Dziewoński: [C: 03+1] "I can run the maintenance script once deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983440 (owner: 10Esanders) [14:49:17] milkydefer: great. should be soon deployed in production [14:50:02] cwhite: regarding your patch, do you want to deploy for yourself, or should I work on it as well? [14:51:51] urbanecm: Please send it, I haven't received deployment training [14:52:07] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for mcastro-wmf - https://phabricator.wikimedia.org/T353273 (10herron) 05Open→03Stalled [14:52:11] will do [14:52:18] (03PS2) 10Urbanecm: Configure and enable StatsLib for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983229 (https://phabricator.wikimedia.org/T343024) (owner: 10Cwhite) [14:52:21] (03CR) 10Urbanecm: [C: 03+2] Configure and enable StatsLib for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983229 (https://phabricator.wikimedia.org/T343024) (owner: 10Cwhite) [14:52:36] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:981714|Enable action blocks for zhwiki (T353120)]] (duration: 08m 58s) [14:52:40] T353120: Enable action blocks in Chinese Wikipedia - https://phabricator.wikimedia.org/T353120 [14:52:42] milkydefer: your patch's live [14:52:47] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for mcastro-wmf - https://phabricator.wikimedia.org/T353273 (10herron) [14:52:54] I'll check [14:52:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983229 (https://phabricator.wikimedia.org/T343024) (owner: 10Cwhite) [14:53:15] confirmed [14:53:15] Thank you! [14:53:25] thanks [14:53:31] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:37] (03Merged) 10jenkins-bot: Configure and enable StatsLib for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983229 (https://phabricator.wikimedia.org/T343024) (owner: 10Cwhite) [14:53:54] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:983229|Configure and enable StatsLib for production (T343024)]], [[gerrit:983529|Revert "util.main: Don't use mw.Map(), use a native Map() instead" (T353571 T353076)]] [14:54:02] T343024: Configure MediaWiki to use new StatsLib in production - https://phabricator.wikimedia.org/T343024 [14:54:02] T353571: CSD tagging broken, displays an error and doesn't write to user talk (this.map.exists is not a function) - https://phabricator.wikimedia.org/T353571 [14:54:03] T353076: Deprecate and then drop mw.Map, obviated now we require ES6 - https://phabricator.wikimedia.org/T353076 [14:55:12] !log urbanecm@deploy2002 cwhite and urbanecm and chlod: Backport for [[gerrit:983229|Configure and enable StatsLib for production (T343024)]], [[gerrit:983529|Revert "util.main: Don't use mw.Map(), use a native Map() instead" (T353571 T353076)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:55:17] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:50] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10herron) 05Stalled→03Invalid Hello! Grooming the backlog today. Given that we've been in a holding pattern on this for some time I'll temporarily close as 'invalid' (since... [14:58:00] cwhite: MPGuy2824: can you test your patches at a mwdebug server, please? [14:58:13] (actually for cwhite's, probably not really possible, since it removes an isDebug condition) [14:58:27] mine's good to go. /me monitors logstash [14:58:34] ack [14:58:36] urbanecm: tested, all good [14:58:40] thanks, proceeding [14:58:41] !log urbanecm@deploy2002 cwhite and urbanecm and chlod: Continuing with sync [14:59:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:59:48] (03PS1) 10Btullis: Add kubeadm files for superset namespaces [puppet] - 10https://gerrit.wikimedia.org/r/983720 (https://phabricator.wikimedia.org/T347710) [15:01:20] (03PS2) 10Btullis: Add kubeadm files for superset namespaces [puppet] - 10https://gerrit.wikimedia.org/r/983720 (https://phabricator.wikimedia.org/T347710) [15:02:12] (03PS1) 10Muehlenhoff: mariadb::monitor_memory: Update package name [puppet] - 10https://gerrit.wikimedia.org/r/983721 [15:02:15] (03PS1) 10FNegri: [toolsdb] fix override syntax [puppet] - 10https://gerrit.wikimedia.org/r/983722 (https://phabricator.wikimedia.org/T353093) [15:02:28] (03PS2) 10Muehlenhoff: mariadb::monitor_memory: Update package name [puppet] - 10https://gerrit.wikimedia.org/r/983721 [15:02:50] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [15:04:15] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:983229|Configure and enable StatsLib for production (T343024)]], [[gerrit:983529|Revert "util.main: Don't use mw.Map(), use a native Map() instead" (T353571 T353076)]] (duration: 10m 20s) [15:04:27] cwhite: MPGuy2824: should be live [15:04:27] T343024: Configure MediaWiki to use new StatsLib in production - https://phabricator.wikimedia.org/T343024 [15:04:28] T353571: CSD tagging broken, displays an error and doesn't write to user talk (this.map.exists is not a function) - https://phabricator.wikimedia.org/T353571 [15:04:28] T353076: Deprecate and then drop mw.Map, obviated now we require ES6 - https://phabricator.wikimedia.org/T353076 [15:04:29] anything else? [15:04:31] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/954/con" [puppet] - 10https://gerrit.wikimedia.org/r/983720 (https://phabricator.wikimedia.org/T347710) (owner: 10Btullis) [15:04:52] look good from my pov - thanks again! [15:04:58] urbancm: Thank you [15:05:32] any time [15:08:01] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:09:31] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:09:35] (03CR) 10Andrew Bogott: [C: 03+1] [toolsdb] fix override syntax [puppet] - 10https://gerrit.wikimedia.org/r/983722 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [15:09:54] (03CR) 10Majavah: [C: 03+1] [toolsdb] fix override syntax [puppet] - 10https://gerrit.wikimedia.org/r/983722 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [15:10:04] (03PS1) 10Ladsgroup: Beta: Move the override of ores LW URL to CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983724 (https://phabricator.wikimedia.org/T348298) [15:10:23] (03CR) 10FNegri: [C: 03+2] [toolsdb] fix override syntax [puppet] - 10https://gerrit.wikimedia.org/r/983722 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [15:10:52] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Beta: Move the override of ores LW URL to CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983724 (https://phabricator.wikimedia.org/T348298) (owner: 10Ladsgroup) [15:11:08] (03CR) 10Ladsgroup: [C: 03+2] Beta: Move the override of ores LW URL to CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983724 (https://phabricator.wikimedia.org/T348298) (owner: 10Ladsgroup) [15:11:57] (03Merged) 10jenkins-bot: Beta: Move the override of ores LW URL to CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983724 (https://phabricator.wikimedia.org/T348298) (owner: 10Ladsgroup) [15:12:37] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10jhathaway) wolfssl is packaged in Debian, so that may be a possible option longer term, https://tracker.debian.org/pkg/wolfssl. [15:13:10] (03CR) 10Marostegui: [C: 03+1] mariadb::monitor_memory: Update package name [puppet] - 10https://gerrit.wikimedia.org/r/983721 (owner: 10Muehlenhoff) [15:13:21] (03PS1) 10Bking: wdqs: remove unused CNAME [dns] - 10https://gerrit.wikimedia.org/r/983725 (https://phabricator.wikimedia.org/T352111) [15:16:04] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp4037.ulsfo.wmnet [15:16:04] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp4037.ulsfo.wmnet [15:16:26] !log repooling cp4037 (T352876) [15:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:31] T352876: cp4037 reimage for cookbook getting stuck at PXE boot - https://phabricator.wikimedia.org/T352876 [15:16:37] (03PS3) 10Vgutierrez: traffic: Provide a dashboard link for LVSRealServerMSS [alerts] - 10https://gerrit.wikimedia.org/r/982808 (https://phabricator.wikimedia.org/T351069) [15:16:52] (03CR) 10Vgutierrez: "thanks for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/982808 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:17:48] done [15:24:21] (03PS1) 10Ladsgroup: Revert "Revert "[beta] ores-extension: enable revertrisk model for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983734 (https://phabricator.wikimedia.org/T348298) [15:24:29] (03PS2) 10Ladsgroup: Revert "Revert "[beta] ores-extension: enable revertrisk model for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983734 (https://phabricator.wikimedia.org/T348298) [15:24:37] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "[beta] ores-extension: enable revertrisk model for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983734 (https://phabricator.wikimedia.org/T348298) (owner: 10Ladsgroup) [15:25:19] (03Merged) 10jenkins-bot: Revert "Revert "[beta] ores-extension: enable revertrisk model for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983734 (https://phabricator.wikimedia.org/T348298) (owner: 10Ladsgroup) [15:26:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "This is breaking longer term views for certain dashboards, for example https://grafana.wikimedia.org/d/pr6ZUm5nz/haproxy-cluster-view?forc" [puppet] - 10https://gerrit.wikimedia.org/r/979163 (owner: 10Herron) [15:26:37] (03PS1) 10FNegri: mariadb::service chmod override file [puppet] - 10https://gerrit.wikimedia.org/r/983746 [15:27:57] (03CR) 10Filippo Giunchedi: [C: 03+1] traffic: Provide a dashboard link for LVSRealServerMSS [alerts] - 10https://gerrit.wikimedia.org/r/982808 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:30:57] (03PS1) 10Herron: Revert "thanos-query: enable auto-downsampling" [puppet] - 10https://gerrit.wikimedia.org/r/983735 [15:31:12] (03PS1) 10Elukey: Revert "services: update Docker image and settings for Recommendation API" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983736 [15:33:13] (03CR) 10Elukey: [C: 03+2] Revert "services: update Docker image and settings for Recommendation API" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983736 (owner: 10Elukey) [15:34:37] (03CR) 10Herron: [C: 03+2] Revert "thanos-query: enable auto-downsampling" [puppet] - 10https://gerrit.wikimedia.org/r/983735 (owner: 10Herron) [15:35:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "To be clear, the issues are not with irate but rather with irate/rate period of 5m (or shorter)" [puppet] - 10https://gerrit.wikimedia.org/r/983735 (owner: 10Herron) [15:36:37] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [15:37:00] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [15:38:32] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:39:54] (03PS9) 10Cathal Mooney: Refactor server provision script to select params based on profile [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) [15:41:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2074-2080 to codfw - jhancock@cumin2002" [15:42:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2074-2080 to codfw - jhancock@cumin2002" [15:42:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:41] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10MoritzMuehlenhoff) >>! In T352744#9413140, @jhathaway wrote: > wolfssl is packaged in Debian, so that may be a possible option longer term, https://tracker.debian.org/pkg/wolfssl. wolfssl isn't fully... [15:48:39] (03CR) 10Bking: [C: 03+2] rdf_streaming_updater: drop RdfStreamingUpdaterNotEnoughTaskSlots [alerts] - 10https://gerrit.wikimedia.org/r/983449 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [15:49:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2074.mgmt.codfw.wmnet with reboot policy FORCED [15:49:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2075.mgmt.codfw.wmnet with reboot policy FORCED [15:49:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2076.mgmt.codfw.wmnet with reboot policy FORCED [15:49:16] (03CR) 10Bking: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/983449 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [15:49:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2077.mgmt.codfw.wmnet with reboot policy FORCED [15:49:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with reboot policy FORCED [15:49:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2079.mgmt.codfw.wmnet with reboot policy FORCED [15:49:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2080.mgmt.codfw.wmnet with reboot policy FORCED [15:49:38] (03CR) 10Bking: [C: 03+2] rdf_streaming_updater: drop RdfStreamingUpdaterNotEnoughTaskSlots [alerts] - 10https://gerrit.wikimedia.org/r/983449 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [15:49:54] (03Merged) 10jenkins-bot: rdf_streaming_updater: drop RdfStreamingUpdaterNotEnoughTaskSlots [alerts] - 10https://gerrit.wikimedia.org/r/983449 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [15:51:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:53:30] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: switch to flink-app dashboard [alerts] - 10https://gerrit.wikimedia.org/r/983450 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [15:54:49] (03Merged) 10jenkins-bot: rdf-streaming-updater: switch to flink-app dashboard [alerts] - 10https://gerrit.wikimedia.org/r/983450 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [15:59:21] (03CR) 10Bking: [C: 03+1] charts: remove flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/983451 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [16:00:14] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) >>! In T271142#9382040, @Volans wrote: > Another datapoint for the mw*/parse* clusters, they will... [16:01:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:01:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2074.mgmt.codfw.wmnet with reboot policy FORCED [16:01:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2075.mgmt.codfw.wmnet with reboot policy FORCED [16:01:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2077.mgmt.codfw.wmnet with reboot policy FORCED [16:01:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2078.mgmt.codfw.wmnet with reboot policy FORCED [16:02:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2076.mgmt.codfw.wmnet with reboot policy FORCED [16:02:13] (03PS1) 10Ilias Sarantopoulos: testwiki: enable revertris kmodel in ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) [16:05:54] (03PS2) 10Ilias Sarantopoulos: testwiki: enable revertrisk model in ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) [16:12:38] (03PS1) 10Brouberol: spark-history: enable definition of spark env vars in spark-env.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/983748 (https://phabricator.wikimedia.org/T352863) [16:12:40] (03PS1) 10Brouberol: spark-history: set public DNS to yarn.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/983749 (https://phabricator.wikimedia.org/T352863) [16:13:09] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Vgutierrez) [16:14:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2079.mgmt.codfw.wmnet with reboot policy FORCED [16:16:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2080.mgmt.codfw.wmnet with reboot policy FORCED [16:16:22] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2074'] [16:16:29] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2075'] [16:16:44] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2077'] [16:16:53] (03CR) 10Kosta Harlan: [C: 03+1] testwiki: enable revertrisk model in ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [16:16:54] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078'] [16:17:00] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2079'] [16:17:05] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2079'] [16:17:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2074'] [16:17:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2075'] [16:18:16] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2076'] [16:18:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2076'] [16:18:33] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:18:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2077'] [16:18:58] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2078'] [16:20:09] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2080'] [16:20:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2079'] [16:20:26] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2079'] [16:20:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2080'] [16:20:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2079'] [16:21:19] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2074'] [16:21:49] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2075'] [16:21:56] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:22:14] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2076'] [16:22:38] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2077'] [16:23:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078'] [16:23:27] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2079'] [16:23:50] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2080'] [16:23:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2080'] [16:25:19] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [16:25:32] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [16:28:01] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [16:28:11] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [16:28:40] these are roll restarts to pick up new schemas --^ [16:29:00] (03CR) 10Brouberol: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/983720 (https://phabricator.wikimedia.org/T347710) (owner: 10Btullis) [16:29:31] (03PS1) 10Filippo Giunchedi: thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) [16:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T1630). [16:30:47] (03PS1) 10Herron: admin: add rkhan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983753 (https://phabricator.wikimedia.org/T353370) [16:31:10] (03PS2) 10Brouberol: spark-history: enable definition of spark env vars in spark-env.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/983748 (https://phabricator.wikimedia.org/T352863) [16:31:12] (03PS2) 10Brouberol: spark-history: set public DNS to yarn.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/983749 (https://phabricator.wikimedia.org/T352863) [16:31:31] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [16:31:40] (03CR) 10CI reject: [V: 04-1] admin: add rkhan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983753 (https://phabricator.wikimedia.org/T353370) (owner: 10Herron) [16:32:23] (03CR) 10CI reject: [V: 04-1] thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [16:32:27] (03PS1) 10Sergio Gimeno: Temporary users: set notifyBeforeExpirationDays same as expireAfterDays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983755 (https://phabricator.wikimedia.org/T344694) [16:33:00] (03CR) 10Elukey: "Left a comment passing by :)" [puppet] - 10https://gerrit.wikimedia.org/r/983720 (https://phabricator.wikimedia.org/T347710) (owner: 10Btullis) [16:33:32] !log akosiaris@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Give AAAA and PTR records to mc2042-mc2055 - akosiaris@cumin1001" [16:33:42] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983756 (https://phabricator.wikimedia.org/T128546) [16:33:44] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1077 [16:33:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1077 [16:33:57] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2080 [16:34:23] (03PS2) 10Herron: admin: add rkhan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983753 (https://phabricator.wikimedia.org/T353370) [16:34:25] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Give AAAA and PTR records to mc2042-mc2055 - akosiaris@cumin1001" [16:34:25] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:35:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2080 [16:35:15] (03CR) 10CI reject: [V: 04-1] admin: add rkhan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983753 (https://phabricator.wikimedia.org/T353370) (owner: 10Herron) [16:35:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2080'] [16:35:28] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983756 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:35:54] 10SRE, 10Observability-Metrics, 10Goal, 10Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10elukey) Tried to deploy rec-api without the statsd exporter, all good but the metrics are still not 100% ok. From a quick look it seems that we define the new metric... [16:36:42] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983756 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:37:06] (03PS3) 10Herron: admin: add rkhan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983753 (https://phabricator.wikimedia.org/T353370) [16:38:28] (03PS2) 10Filippo Giunchedi: thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) [16:38:30] (03CR) 10BCornwall: [C: 03+1] wdqs: remove unused CNAME [dns] - 10https://gerrit.wikimedia.org/r/983725 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [16:41:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2078'] [16:41:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2080'] [16:41:15] (03CR) 10CI reject: [V: 04-1] thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [16:41:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2077'] [16:41:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2079'] [16:41:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2074'] [16:41:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2075'] [16:41:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2076'] [16:44:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10herron) [16:44:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10herron) Hi @odimitrijevic , @Milimetric -- could you please review/approve this user addition to the `analytics-pri... [16:45:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10herron) p:05Triage→03Medium [16:46:17] (03CR) 10Volans: [C: 03+1] "LGTM, make sure to check the affected cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [16:48:15] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) Regarding the mc* hosts, I 've been mulling over this one for some time now trying to figure out t... [16:48:21] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10Jhancock.wm) [16:48:23] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:983756| Bumping portals to master (T128546)]] (duration: 06m 08s) [16:48:27] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:50:27] 10SRE-tools, 10Infrastructure-Foundations: Decommission cookbook: lock per switch - https://phabricator.wikimedia.org/T353513 (10Volans) If the delicate part is the call to `configure_switch_interfaces()` we can just change it's signature to require a `lock` [[ https://doc.wikimedia.org/spicerack/master/api/in... [16:51:02] (03CR) 10Ladsgroup: [C: 03+2] Add virtual domain for botpasswords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [16:52:20] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [16:52:27] !log akosiaris@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [16:52:46] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [16:52:46] (03PS1) 10Ladsgroup: Change virtual domain of botpassword to plural [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983758 (https://phabricator.wikimedia.org/T351559) [16:54:51] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:983756| Bumping portals to master (T128546)]] (duration: 06m 28s) [16:54:52] (03PS1) 10Peter Fischer: Search update pipeline: enable commonwiki and wikidatawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/983759 [16:54:58] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:55:14] (03CR) 10DCausse: [C: 03+1] Search update pipeline: enable commonwiki and wikidatawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/983759 (owner: 10Peter Fischer) [16:55:23] !log akosiaris@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Give AAAA and PTR records to mc-gp[12]00[123] - akosiaris@cumin1001" [16:55:44] (03PS2) 10Alexandros Kosiaris: tlsproxy::envoy: Allow specifying a percentage to be traced [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) [16:55:57] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [16:56:16] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Give AAAA and PTR records to mc-gp[12]00[123] - akosiaris@cumin1001" [16:56:16] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:01:17] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10herron) [17:01:53] (03CR) 10DCausse: [C: 03+2] charts: remove flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/983451 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [17:02:45] (03Merged) 10jenkins-bot: charts: remove flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/983451 (https://phabricator.wikimedia.org/T350784) (owner: 10DCausse) [17:05:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2074.codfw.wmnet with OS bullseye [17:05:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2074.codfw.wmnet with OS bullseye [17:12:28] !log bking@kafka-jumbo1007 kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5 T351503 [17:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:36] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [17:14:10] !log bking@kafka-jumbo1007 kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5 T351503 [17:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:58] (03PS1) 10Ahmon Dancy: cli.py: Improve help text for docker-pkg help 'select' argument [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 [17:17:23] (03CR) 10CI reject: [V: 04-1] cli.py: Improve help text for docker-pkg help 'select' argument [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 (owner: 10Ahmon Dancy) [17:19:27] (03PS2) 10Ahmon Dancy: cli.py: Improve help text for docker-pkg help 'select' argument [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 [17:19:41] (03PS1) 10Dzahn: devtools/phabricator: remove comment about git-ssh vcs service [puppet] - 10https://gerrit.wikimedia.org/r/983871 (https://phabricator.wikimedia.org/T296022) [17:21:54] (03CR) 10Dzahn: [C: 03+2] "just a comment referencing deprecated service" [puppet] - 10https://gerrit.wikimedia.org/r/983871 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [17:22:37] (03CR) 10CI reject: [V: 04-1] cli.py: Improve help text for docker-pkg help 'select' argument [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 (owner: 10Ahmon Dancy) [17:23:50] (03PS3) 10Ahmon Dancy: cli.py: Improve help text for docker-pkg help 'select' argument [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 [17:27:37] (03CR) 10Ahmon Dancy: [C: 03+1] "Ready for review." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 (owner: 10Ahmon Dancy) [17:33:13] (03PS1) 10Bking: wdqs: Enable ipv6 for envoy tls_terminator [puppet] - 10https://gerrit.wikimedia.org/r/983893 (https://phabricator.wikimedia.org/T347355) [17:34:00] (03PS2) 10Bking: wdqs: Enable ipv6 for envoy tls_terminator [puppet] - 10https://gerrit.wikimedia.org/r/983893 (https://phabricator.wikimedia.org/T347355) [17:35:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983893 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:35:26] (03PS2) 10Dzahn: planet: Set Puppet 7 per role [puppet] - 10https://gerrit.wikimedia.org/r/983669 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [17:35:58] (03Abandoned) 10Bking: wdqs: envoy TLS termination for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/544829 (https://phabricator.wikimedia.org/T210411) (owner: 10Mathew.onipe) [17:36:49] (03CR) 10Dzahn: "In this context, I wonder about https://phabricator.wikimedia.org/T255568 and if this is really that easy or that ticket is outdated..etc" [puppet] - 10https://gerrit.wikimedia.org/r/983893 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:38:12] (03CR) 10Dzahn: [C: 03+2] planet: Set Puppet 7 per role [puppet] - 10https://gerrit.wikimedia.org/r/983669 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [17:42:17] (03CR) 10Vgutierrez: [C: 03+2] traffic: Provide a dashboard link for LVSRealServerMSS [alerts] - 10https://gerrit.wikimedia.org/r/982808 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [17:42:33] (03CR) 10Dzahn: [C: 03+2] roles/hieradata: rename serviceops-collab team [puppet] - 10https://gerrit.wikimedia.org/r/983481 (owner: 10Dzahn) [17:47:31] (03PS3) 10Filippo Giunchedi: thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) [17:47:35] (03PS2) 10Dzahn: roles/hieradata: rename serviceops-collab team [puppet] - 10https://gerrit.wikimedia.org/r/983481 [17:48:49] (03CR) 10Bking: wdqs: Enable ipv6 for envoy tls_terminator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983893 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:51:02] (03CR) 10CI reject: [V: 04-1] thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [17:52:34] (03PS4) 10Filippo Giunchedi: thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) [17:53:01] (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/983689 (owner: 10Muehlenhoff) [17:54:33] jouncebot: nowandnext [17:54:33] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [17:54:33] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T1800) [17:54:34] In 0 hour(s) and 5 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T1800) [17:56:07] (03PS3) 10Alexandros Kosiaris: tlsproxy::envoy: Allow specifying a percentage to be traced [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) [17:56:09] (03PS1) 10Alexandros Kosiaris: mediawiki canaries: Include opentelemetry::collector [puppet] - 10https://gerrit.wikimedia.org/r/983895 (https://phabricator.wikimedia.org/T351566) [17:57:26] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T1800) [18:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T1800). [18:00:09] (03PS1) 10Herron: admin: add anakanishi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983896 (https://phabricator.wikimedia.org/T353363) [18:00:44] (03CR) 10Dzahn: [C: 03+2] roles/hieradata: rename serviceops-collab team [puppet] - 10https://gerrit.wikimedia.org/r/983481 (owner: 10Dzahn) [18:01:14] (03PS4) 10Bartosz Dziewoński: Update expected RunSingleJob.php status code [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) [18:01:16] (03PS1) 10Bartosz Dziewoński: Temporarily remove RunSingleJob.php status code check [puppet] - 10https://gerrit.wikimedia.org/r/983897 (https://phabricator.wikimedia.org/T352265) [18:01:23] (03PS4) 10Alexandros Kosiaris: tlsproxy::envoy: Allow specifying a percentage to be traced [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) [18:01:36] (03PS5) 10Bartosz Dziewoński: RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) [18:01:44] (03CR) 10CI reject: [V: 04-1] Update expected RunSingleJob.php status code [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [18:01:57] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [18:02:04] (03CR) 10Bartosz Dziewoński: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [18:02:33] (03PS6) 10Bartosz Dziewoński: RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) [18:03:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10herron) Hello! A few approvals are needed here before proceeding @LMixter, could you please approve this reque... [18:04:45] (03CR) 10CI reject: [V: 04-1] Temporarily remove RunSingleJob.php status code check [puppet] - 10https://gerrit.wikimedia.org/r/983897 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [18:07:36] (03PS1) 10DLynch: Revert "Fix English Gboard backspace over aliens" [VisualEditor/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983745 (https://phabricator.wikimedia.org/T353578) [18:07:50] (03PS1) 10DLynch: Revert "Put zero-width space after inline focusable nodes" [VisualEditor/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983906 (https://phabricator.wikimedia.org/T353578) [18:21:36] (03PS2) 10Alexandros Kosiaris: mediawiki canaries: Include opentelemetry::collector [puppet] - 10https://gerrit.wikimedia.org/r/983895 (https://phabricator.wikimedia.org/T351566) [18:21:38] (03PS5) 10Alexandros Kosiaris: tlsproxy::envoy: Allow specifying a percentage to be traced [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) [18:22:51] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [18:24:48] (03PS2) 10Dzahn: site/cumin: rename insetup role for collaboration services [puppet] - 10https://gerrit.wikimedia.org/r/983485 [18:25:18] (03CR) 10CI reject: [V: 04-1] site/cumin: rename insetup role for collaboration services [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [18:25:23] (03CR) 10Dzahn: site/cumin: rename insetup role for collaboration services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [18:25:35] (03PS6) 10Alexandros Kosiaris: services_proxy: Listen on :: and not ::1 [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) [18:28:08] (03CR) 10Alexandros Kosiaris: tlsproxy::envoy: Allow specifying a percentage to be traced (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [18:28:10] (03PS3) 10Dzahn: site/cumin: rename insetup role for collaboration services [puppet] - 10https://gerrit.wikimedia.org/r/983485 [18:31:39] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [18:31:51] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [18:40:01] (03CR) 10Muehlenhoff: check_wmf_styleguide: Remove check to enforce presence of system::role (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/983689 (owner: 10Muehlenhoff) [18:45:09] (03CR) 10Dzahn: [C: 03+2] site/cumin: rename insetup role for collaboration services [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [18:47:17] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10PantheraLeo1359531) Can be tracked what happened at the time when the affected were not able to be uploaded... [18:47:43] (03CR) 10RLazarus: [C: 03+1] "LGTM, pending the CI failure about the commit message -- once that's happy, let me know whenever you're ready and I'm happy to merge for y" [puppet] - 10https://gerrit.wikimedia.org/r/983897 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [18:48:29] (03CR) 10CDanis: [C: 03+1] tlsproxy::envoy: Allow specifying a percentage to be traced [puppet] - 10https://gerrit.wikimedia.org/r/983441 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [18:48:43] (03CR) 10CDanis: [C: 03+1] mediawiki canaries: Include opentelemetry::collector [puppet] - 10https://gerrit.wikimedia.org/r/983895 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [18:49:02] (03CR) 10Dzahn: [C: 03+2] "@Muehlenhoff: this reverted the puppet7 conversion since it was role based.. ooops" [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [18:52:22] (03CR) 10Bartosz Dziewoński: Temporarily remove RunSingleJob.php status code check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983897 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [18:52:31] (03PS1) 10Dzahn: hieradata: rename serviceops_collab role yaml file [puppet] - 10https://gerrit.wikimedia.org/r/983904 [18:52:33] (03PS2) 10Bartosz Dziewoński: Temporarily remove RunSingleJob.php status code check [puppet] - 10https://gerrit.wikimedia.org/r/983897 (https://phabricator.wikimedia.org/T352265) [18:52:49] (03CR) 10Dzahn: [C: 03+2] "was missing https://gerrit.wikimedia.org/r/c/operations/puppet/+/983904/" [puppet] - 10https://gerrit.wikimedia.org/r/983485 (owner: 10Dzahn) [18:56:09] (03PS1) 10Ottomata: WIP - add webrequest.frontend stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) [18:56:50] 10SRE, 10Observability-Metrics, 10Goal, 10Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) >>! In T205870#9413501, @elukey wrote: > Tried to deploy rec-api without the statsd exporter, all good but the metrics are still not 100% ok. From a quick... [18:57:10] (03CR) 10CI reject: [V: 04-1] WIP - add webrequest.frontend stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [18:58:32] (03CR) 10Dzahn: [C: 03+2] hieradata: rename serviceops_collab role yaml file [puppet] - 10https://gerrit.wikimedia.org/r/983904 (owner: 10Dzahn) [18:58:43] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [18:58:53] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:40] (03CR) 10Dzahn: [C: 03+2] "after this: Error: The CRL ... has expired, verify time is synchronized" [puppet] - 10https://gerrit.wikimedia.org/r/983904 (owner: 10Dzahn) [19:10:09] (03PS1) 10DDesouza: Undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983928 (https://phabricator.wikimedia.org/T344393) [19:10:57] (03CR) 10CI reject: [V: 04-1] Undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983928 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [19:18:39] (03PS2) 10DDesouza: Undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983928 (https://phabricator.wikimedia.org/T344393) [19:22:43] (03CR) 10Bking: "Can/should we enable this for Envoy's TLS terminator as well?" [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [19:23:29] (03CR) 10Dzahn: [C: 03+2] "manually edited puppet.conf to change it back to puppet7 and ran the agent and seems fine" [puppet] - 10https://gerrit.wikimedia.org/r/983904 (owner: 10Dzahn) [19:31:37] (03CR) 10Bartosz Dziewoński: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [19:31:42] (03PS5) 10Bartosz Dziewoński: Update expected RunSingleJob.php status code [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) [19:37:00] (03CR) 10Dzahn: "I left a comment on the ticket and that did trigger a response in the form of patches by Αλέξανδρος Κοσιάρης" [puppet] - 10https://gerrit.wikimedia.org/r/983893 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:38:32] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:40:40] (03CR) 10Hashar: [C: 03+1] cli.py: Improve help text for docker-pkg help 'select' argument [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 (owner: 10Ahmon Dancy) [19:49:58] (03CR) 10Dzahn: [C: 03+1] rsync::server: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/983437 (owner: 10Muehlenhoff) [19:50:23] (03PS2) 10Ottomata: WIP - add webrequest.frontend stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) [19:51:02] (03CR) 10CI reject: [V: 04-1] WIP - add webrequest.frontend stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [19:51:54] (03PS3) 10Ottomata: WIP - add webrequest.frontend stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) [19:52:45] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/982419/956/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/982419 (owner: 10Muehlenhoff) [19:54:54] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "ready to go if no concerns from you" [puppet] - 10https://gerrit.wikimedia.org/r/983491 (https://phabricator.wikimedia.org/T333510) (owner: 10Dzahn) [20:05:41] (03PS2) 10DLynch: Revert "Put zero-width space after inline focusable nodes" [VisualEditor/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983906 (https://phabricator.wikimedia.org/T353578) [20:07:54] (03CR) 10Gmodena: [C: 03+2] mw-page-content-enrich: version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/980359 (https://phabricator.wikimedia.org/T345806) (owner: 10Gmodena) [20:09:01] (03Merged) 10jenkins-bot: mw-page-content-enrich: version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/980359 (https://phabricator.wikimedia.org/T345806) (owner: 10Gmodena) [20:19:31] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:19:37] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:22:00] (03CR) 10RLazarus: [C: 03+2] Temporarily remove RunSingleJob.php status code check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983897 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [20:30:09] MatmaRex: merged, deployed, and the tests are still passing as expected 👍 [20:31:38] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:31:45] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:36:30] rzl: oh, thanks [20:37:48] (03PS1) 10Ottomata: wgEventStreams - Add message_key_fields to page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983939 (https://phabricator.wikimedia.org/T338231) [20:38:03] no worries! thanks again for keeping the tests up-to-date [20:38:50] (03CR) 10Gmodena: [C: 03+1] wgEventStreams - Add message_key_fields to page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983939 (https://phabricator.wikimedia.org/T338231) (owner: 10Ottomata) [20:39:01] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - Add message_key_fields to page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983939 (https://phabricator.wikimedia.org/T338231) (owner: 10Ottomata) [20:39:42] (03Merged) 10jenkins-bot: wgEventStreams - Add message_key_fields to page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983939 (https://phabricator.wikimedia.org/T338231) (owner: 10Ottomata) [20:45:05] (03CR) 10Jdlrobson: "Do you need help deploying this Pols12 or do you plan to do that this week?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [20:48:26] !log otto@deploy2002 Synchronized wmf-config/ext-EventStreamConfig.php: Config: [[gerrit:983939|Add message_key_fields to page_content_change stream (T338231)]] (duration: 06m 32s) [20:48:42] T338231: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 [20:52:13] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:52:21] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:52:32] PROBLEM - Host mw2448 is DOWN: PING CRITICAL - Packet loss = 100% [20:53:17] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:53:23] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T2100). [21:00:05] danisztls and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] o/ [21:01:00] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [21:01:05] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [21:01:18] Whoever's deploying, I have three patches -- they all need to be merged before I can test, but one is a submodule pullthrough of the other two so I might need to tweak it after the first two merge. [21:03:05] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [21:03:11] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [21:04:08] I can help w/ deployment today. [21:04:17] Kemayo: ready to go? [21:04:22] Go for it. [21:04:25] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [21:04:29] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [21:05:11] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [21:05:14] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [21:05:18] dancy: The "Update VE core submodule" patch may or may not merge after the other two go -- I'll need to make sure it wound up with the commit hash I'm expecting. :D [21:06:29] (03CR) 10Ahmon Dancy: [C: 03+2] Revert "Fix English Gboard backspace over aliens" [VisualEditor/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983745 (https://phabricator.wikimedia.org/T353578) (owner: 10DLynch) [21:06:34] (03CR) 10Ahmon Dancy: [C: 03+2] Revert "Put zero-width space after inline focusable nodes" [VisualEditor/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983906 (https://phabricator.wikimedia.org/T353578) (owner: 10DLynch) [21:07:14] danisztls: You around? [21:07:50] danisztls: Ready to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/983928/ ? [21:07:59] dancy: yes [21:08:19] OK. Doing that first since it should be a quickie [21:08:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983928 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:09:10] (03Merged) 10jenkins-bot: Undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983928 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:09:27] !log dancy@deploy2002 Started scap: Backport for [[gerrit:983928|Undeploy Reader Demographics 2 survey (T344393)]] [21:09:32] (03Merged) 10jenkins-bot: Revert "Fix English Gboard backspace over aliens" [VisualEditor/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983745 (https://phabricator.wikimedia.org/T353578) (owner: 10DLynch) [21:09:48] (03Merged) 10jenkins-bot: Revert "Put zero-width space after inline focusable nodes" [VisualEditor/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983906 (https://phabricator.wikimedia.org/T353578) (owner: 10DLynch) [21:09:50] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:10:47] !log dancy@deploy2002 dani and dancy: Backport for [[gerrit:983928|Undeploy Reader Demographics 2 survey (T344393)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:11:21] dancy: looks good [21:11:32] thx [21:11:33] !log dancy@deploy2002 dani and dancy: Continuing with sync [21:12:10] dancy: thanks [21:12:21] Kemayo: Looks like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/983937will need fixing up. [21:12:36] dancy: Yeah, give me a minute to work out what I need to change. [21:13:08] (03CR) 10Pols12: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [21:16:54] (03PS2) 10DLynch: Update VE core submodule to wmf.9 (6bada65) [extensions/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983911 (https://phabricator.wikimedia.org/T353578) [21:17:58] !log dancy@deploy2002 Finished scap: Backport for [[gerrit:983928|Undeploy Reader Demographics 2 survey (T344393)]] (duration: 08m 30s) [21:18:22] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:18:58] (03CR) 10DLynch: "recheck" [extensions/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983911 (https://phabricator.wikimedia.org/T353578) (owner: 10DLynch) [21:19:10] dancy: made a different patch for it, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/983911 -- just waiting to see that it actually merges now. [21:22:16] dancy: Well, it doesn't immediately fail for being unmergable. Just waiting on jenkins getting around to actually running tests. [21:35:45] dancy: Okay, tests are passed, you can merge 983911. [21:36:36] (03CR) 10Ahmon Dancy: [C: 03+2] Update VE core submodule to wmf.9 (6bada65) [extensions/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983911 (https://phabricator.wikimedia.org/T353578) (owner: 10DLynch) [21:38:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983911 (https://phabricator.wikimedia.org/T353578) (owner: 10DLynch) [21:53:48] (03Merged) 10jenkins-bot: Update VE core submodule to wmf.9 (6bada65) [extensions/VisualEditor] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983911 (https://phabricator.wikimedia.org/T353578) (owner: 10DLynch) [21:54:05] !log dancy@deploy2002 Started scap: Backport for [[gerrit:983745|Revert "Fix English Gboard backspace over aliens" (T353578 T325129)]], [[gerrit:983906|Revert "Put zero-width space after inline focusable nodes" (T353578 T330284)]], [[gerrit:983911|Update VE core submodule to wmf.9 (6bada65) (T353578 T330284 T325129)]] [21:54:13] T353578: Chinese input: Some words disappear - https://phabricator.wikimedia.org/T353578 [21:54:13] T325129: On-screen keyboard disappears when cursor encounters non-breaking space in Chrome; cannot delete nbsp in Firefox - https://phabricator.wikimedia.org/T325129 [21:54:14] T330284: English Gboard causes corruption when backspacing a focusable node that ends with a Latin letter - https://phabricator.wikimedia.org/T330284 [21:56:07] !log dancy@deploy2002 dancy and kemayo: Backport for [[gerrit:983745|Revert "Fix English Gboard backspace over aliens" (T353578 T325129)]], [[gerrit:983906|Revert "Put zero-width space after inline focusable nodes" (T353578 T330284)]], [[gerrit:983911|Update VE core submodule to wmf.9 (6bada65) (T353578 T330284 T325129)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:57:19] dancy: Nothing's obviously wrong testing on 2002. [21:57:27] OK.. Proceeding. [21:57:44] !log dancy@deploy2002 dancy and kemayo: Continuing with sync [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231218T2200). [22:04:30] (03PS1) 10Xcollazo: Bump refine_sanitize refinery version to pickup fix for T349121 [puppet] - 10https://gerrit.wikimedia.org/r/983946 (https://phabricator.wikimedia.org/T349121) [22:05:39] (03CR) 10Xcollazo: "CC Ben and Marcel." [puppet] - 10https://gerrit.wikimedia.org/r/983946 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [22:07:40] !log dancy@deploy2002 Finished scap: Backport for [[gerrit:983745|Revert "Fix English Gboard backspace over aliens" (T353578 T325129)]], [[gerrit:983906|Revert "Put zero-width space after inline focusable nodes" (T353578 T330284)]], [[gerrit:983911|Update VE core submodule to wmf.9 (6bada65) (T353578 T330284 T325129)]] (duration: 13m 34s) [22:07:47] T353578: Chinese input: Some words disappear - https://phabricator.wikimedia.org/T353578 [22:07:48] T325129: On-screen keyboard disappears when cursor encounters non-breaking space in Chrome; cannot delete nbsp in Firefox - https://phabricator.wikimedia.org/T325129 [22:07:48] T330284: English Gboard causes corruption when backspacing a focusable node that ends with a Latin letter - https://phabricator.wikimedia.org/T330284 [22:08:26] !log UTC late backport window completed. [22:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:19] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/983946 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [22:28:14] (03PS1) 10Bking: wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) [22:30:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [22:31:00] (03CR) 10CI reject: [V: 04-1] wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [22:31:51] (03PS1) 10Bking: wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) [22:32:44] (03Abandoned) 10Bking: wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [22:33:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [22:35:24] !log Deployed patch for T347704 [22:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:43] one host failed to restart php-fpm [22:36:02] I'll file a bug as well so this is tracked somewhere [22:37:57] going to try scap one more time and then if that fails, create a bug [22:39:12] (03PS1) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983949 (https://phabricator.wikimedia.org/T353672) [22:39:42] (03CR) 10CI reject: [V: 04-1] prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983949 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [22:43:27] (03CR) 10Pols12: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [22:45:25] (03PS2) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983949 (https://phabricator.wikimedia.org/T353672) [22:45:53] (03CR) 10CI reject: [V: 04-1] prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983949 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [22:46:45] (03PS2) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) [22:49:21] second scap still had the same issue with the bug, making a phab ticket [22:50:01] (03Abandoned) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983949 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [22:50:29] (03Restored) 10Bking: wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [22:51:42] (03PS2) 10Bking: wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) [22:53:08] (03PS3) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) [22:55:35] (03PS4) 10Bking: wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) [22:57:31] (03PS5) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) [23:01:15] (03PS6) 10Bking: wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) [23:04:08] (03PS7) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) [23:05:23] (03PS8) 10Bking: wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) [23:14:01] (03CR) 10Herron: verlib2: initial packaging (032 comments) [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 (owner: 10Herron) [23:18:49] 10SRE, 10Release-Engineering-Team, 10Scap: Php-fpm failed to start on one host during Scap - https://phabricator.wikimedia.org/T353679 (10Mstyles) [23:19:23] (03PS1) 10Dwisehaupt: Add dyna record for community-crm [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) [23:22:35] (03PS1) 10Dwisehaupt: Set the cdn to pass requests for community-crm [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T302995) [23:23:43] filed https://phabricator.wikimedia.org/T353679 for the scap error [23:25:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10ANakanishi_WMF) Hi @jhathaway, I've linked my phabricator and LDAP accounts. I've also gone ahead and updated my... [23:30:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10LMixter) Approved. :) [23:30:39] 10SRE, 10Release-Engineering-Team, 10Scap: Php-fpm failed to start on one host during Scap (ssh to mw2448.codfw.wmnet timed out) - https://phabricator.wikimedia.org/T353679 (10bd808) [23:31:19] anybody have time to poke the console of host mw2448.codfw.wmnet to see what's up? ^ [23:31:46] 10SRE, 10Release-Engineering-Team, 10Scap: Php-fpm failed to start on one host during Scap (ssh to mw2448.codfw.wmnet timed out) - https://phabricator.wikimedia.org/T353679 (10dancy) Thanks for filing this! There have been several reports of mw2448 going offline this year: T345597, T334429 + T334420 (probab... [23:36:19] * taavi looking [23:38:32] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:39:18] 10SRE, 10Release-Engineering-Team, 10Scap: Php-fpm failed to start on one host during Scap (ssh to mw2448.codfw.wmnet timed out) - https://phabricator.wikimedia.org/T353679 (10taavi) ` ------------------------------------------------------------------------------- Record: 3 Date/Time: 12/18/2023 19:48... [23:40:07] do conftool changes not log things to SAL anymore? [23:40:58] !log conftool codfw/appserver/nginx/mw2448.codfw.wmnet: pooled changed yes => inactive # T353679, not sure why it was not logged automatically [23:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:03] T353679: Php-fpm failed to start on one host during Scap (ssh to mw2448.codfw.wmnet timed out) - https://phabricator.wikimedia.org/T353679 [23:43:44] 10SRE, 10ops-codfw, 10serviceops: Php-fpm failed to start on one host during Scap (ssh to mw2448.codfw.wmnet timed out) - https://phabricator.wikimedia.org/T353679 (10taavi) I've set this host as `pooled=inactive` in conftool. Tagging #serviceops since this is their host and #ops-codfw directly given the his... [23:45:40] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10taavi)