[00:16:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:16:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:17:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:17:43] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:52] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host stewards2001.codfw.wmnet [00:27:53] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [00:30:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:39] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM stewards2001.codfw.wmnet - dzahn@cumin1001" [00:30:59] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) ` dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 2 --disk 20 --cluster codfw -t T344164 --gr... [00:31:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM stewards2001.codfw.wmnet - dzahn@cumin1001" [00:31:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:31:29] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache stewards2001.codfw.wmnet on all recursors [00:31:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) stewards2001.codfw.wmnet on all recursors [00:31:58] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM stewards2001.codfw.wmnet - dzahn@cumin1001" [00:32:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM stewards2001.codfw.wmnet - dzahn@cumin1001" [00:32:48] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=96) for new host stewards2001.codfw.wmnet [00:35:28] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) ` Exception raised while parsing arguments for cookbook sre.hosts.reimage: Traceback (most recent call last): Fil... [00:35:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/971435 [00:39:05] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/971435 (owner: 10TrainBranchBot) [00:45:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:03] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:21] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:53] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:18] (03PS1) 10Dzahn: site/netboot: add special VMs for stewards [puppet] - 10https://gerrit.wikimedia.org/r/972070 (https://phabricator.wikimedia.org/T344164) [00:52:51] (03CR) 10CI reject: [V: 04-1] site/netboot: add special VMs for stewards [puppet] - 10https://gerrit.wikimedia.org/r/972070 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [00:53:05] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:29] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:13] (03PS2) 10Dzahn: site/netboot: add special VMs for stewards [puppet] - 10https://gerrit.wikimedia.org/r/972070 (https://phabricator.wikimedia.org/T344164) [00:58:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/971435 (owner: 10TrainBranchBot) [00:59:53] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:15] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T350643 (10phaultfinder) [01:14:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:29:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:09:07] (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:14:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:38:11] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:54:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:06] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T0300) [03:07:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.4 [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/971436 (https://phabricator.wikimedia.org/T350080) [03:07:35] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.4 [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/971436 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [03:08:12] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:31] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:09:37] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 867.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:25:26] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.4 [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/971436 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [03:28:27] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T0400) [04:01:26] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972081 (https://phabricator.wikimedia.org/T350080) [04:01:28] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972081 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [04:02:11] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972081 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [04:02:36] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.4 refs T350080 [04:02:40] T350080: 1.42.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T350080 [04:22:45] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:33:09] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:35:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:40] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.4 refs T350080 (duration: 51m 04s) [04:53:46] T350080: 1.42.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T350080 [04:55:55] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.2 (duration: 02m 12s) [05:00:29] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-11-06-060744-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971633 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [05:01:18] (03Merged) 10jenkins-bot: Update cxserver to 2023-11-06-060744-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971633 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [05:07:17] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:07:39] Updating cxserver ^^ [05:07:49] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:20:23] (03PS1) 10KartikMistry: cxserver: Bump chart to 0.2.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972084 [05:23:13] (03CR) 10KartikMistry: [C: 03+2] cxserver: Bump chart to 0.2.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972084 (owner: 10KartikMistry) [05:24:17] (03Merged) 10jenkins-bot: cxserver: Bump chart to 0.2.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972084 (owner: 10KartikMistry) [05:32:08] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:32:22] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:44:06] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:44:40] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:45:19] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:46:28] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:48:02] !log Updated cxserver to 2023-11-06-060744-production (T333969, T350229, T350241, T350373) [05:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:10] T350229: Post-creation work for dgawiki - https://phabricator.wikimedia.org/T350229 [05:48:11] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [05:48:11] T350241: Post-creation work for zghwiki - https://phabricator.wikimedia.org/T350241 [05:48:11] T350373: Post-creation work for bbcwiki - https://phabricator.wikimedia.org/T350373 [05:50:27] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:11:46] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:16:14] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 35 hosts with reason: Primary switchover s1 T350142 [06:16:18] T350142: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T350142 [06:16:41] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 35 hosts with reason: Primary switchover s1 T350142 [06:18:12] (JobUnavailable) firing: (3) Reduced availability for job etherpad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:23:12] (JobUnavailable) firing: (3) Reduced availability for job etherpad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:26:46] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:53:17] (03PS2) 10VolkerE: Replace WikimediaUI Base with Codex design tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971604 (https://phabricator.wikimedia.org/T331403) [06:54:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:59:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T0700) [07:00:05] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T0700). Please do the needful. [07:00:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:04:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:09:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:27:42] (03CR) 10Jelto: [C: 03+2] Halve profile::gitlab::runner::buildkitd_gckeepstorage [puppet] - 10https://gerrit.wikimedia.org/r/971502 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy) [07:34:39] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:27] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:33] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:39] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:07] (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:40:07] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:21] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:27] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:42:53] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:48:57] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:07] (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:51:45] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:00:05] Amir1, Urbanecm, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:00:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:02:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:02:19] (03CR) 10Muehlenhoff: [C: 03+1] site/netboot: add special VMs for stewards [puppet] - 10https://gerrit.wikimedia.org/r/972070 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [08:02:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.348 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:03:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:04:01] 10SRE: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10fgiunchedi) [08:07:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:11:41] I'm going to silence the beta wikifunctions alert, I don't think anyone is looking at it [08:19:27] (03CR) 10Muehlenhoff: [C: 03+2] Update PHP hook to use Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/971921 (owner: 10Muehlenhoff) [08:19:46] (03PS1) 10Volans: sre.ganeti.makevm: fix parameter passed to reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/972318 (https://phabricator.wikimedia.org/T344164) [08:22:51] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Volans) >>! In T344164#9311058, @Dzahn wrote: > ` > Exception raised while parsing arguments for cookbook sre.hosts.reimage:... [08:24:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:31:26] 10SRE: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10MatthewVernon) Mmm. AFAICT (I need to confirm properly with test case for a bug report against `apt`) there is no priority setting that achieves "use -backports packages where necessary to provide a versioned dependen... [08:34:14] (03CR) 10Vgutierrez: [C: 03+1] haproxy: enable healthcheck-dedicated backend in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/971966 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [08:34:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:54] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-codfw and A:cp [08:36:46] godog: isn't the beta cluster part of our duties? :) [08:37:10] well volunteered ;p [08:37:28] (03CR) 10Ayounsi: Split interface_automation into multiple files (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 (owner: 10Ayounsi) [08:37:47] vgutierrez: no :) [08:37:55] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: enable healthcheck-dedicated backend in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/971966 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [08:38:20] (03CR) 10Ayounsi: [C: 03+2] Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 (owner: 10Ayounsi) [08:39:03] (03Merged) 10jenkins-bot: Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 (owner: 10Ayounsi) [08:39:36] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:40:28] (03CR) 10Ayounsi: [C: 03+2] sre.hosts.reimage: use the new ImportPuppetDB path [cookbooks] - 10https://gerrit.wikimedia.org/r/971959 (owner: 10Ayounsi) [08:41:05] puppet seems to be broken on deployment-cache-text08 (The last Puppet run was at Mon Oct 2 14:21:32 UTC 2023 (51499 minutes ago).) [08:42:16] (03PS1) 10Fabfur: haproxy: enable healthcheck-dedicated backend in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/972320 (https://phabricator.wikimedia.org/T348851) [08:43:07] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Volans) @Dzahn once he above patch is merged you can proceed directly running the reimage cookbook on the host as the VM was... [08:44:27] (03Merged) 10jenkins-bot: sre.hosts.reimage: use the new ImportPuppetDB path [cookbooks] - 10https://gerrit.wikimedia.org/r/971959 (owner: 10Ayounsi) [08:44:40] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:44:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:45:44] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/972320 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [08:49:36] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:50:37] (03Abandoned) 10DCausse: rdf-streaming-updater: simplify parallelism configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/961024 (https://phabricator.wikimedia.org/T346456) (owner: 10DCausse) [08:54:51] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:58:02] (03CR) 10Ayounsi: [C: 03+2] Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi) [08:58:16] (03CR) 10Ayounsi: [C: 03+2] provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 (owner: 10Ayounsi) [08:58:23] (03CR) 10Ayounsi: [C: 03+2] provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752 (owner: 10Ayounsi) [08:58:38] (03CR) 10Ayounsi: [C: 03+2] Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi) [08:58:47] (03CR) 10Ayounsi: [C: 03+2] Remove "Import Interfaces from a JSON blob" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/971958 (owner: 10Ayounsi) [08:58:52] (03Merged) 10jenkins-bot: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi) [08:59:05] (03Merged) 10jenkins-bot: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 (owner: 10Ayounsi) [08:59:09] (03Merged) 10jenkins-bot: provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752 (owner: 10Ayounsi) [08:59:12] (03Merged) 10jenkins-bot: Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi) [08:59:23] (03Merged) 10jenkins-bot: Remove "Import Interfaces from a JSON blob" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/971958 (owner: 10Ayounsi) [08:59:36] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:05] jnuche and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T0900). [09:00:27] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [09:00:37] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [09:01:08] (03PS1) 10KartikMistry: Update cxserver to 2023-11-07-081511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/972323 (https://phabricator.wikimedia.org/T349118) [09:01:33] morning, I'll deploy the train in a few minutes [09:07:16] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [09:07:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:07:23] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972328 (https://phabricator.wikimedia.org/T350080) [09:07:25] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972328 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [09:08:09] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972328 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [09:09:17] (03CR) 10DCausse: rdf-streaming-updater: update values for application mode (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [09:12:17] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [09:13:02] (03PS1) 10Arnaudb: debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) [09:13:11] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) [09:13:22] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-codfw and A:cp [09:13:30] 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) 05Open→03Resolved a:05cmooney→03ayounsi https://netbox.wikimedia.org/extras/scripts/move_server.MoveServersUplinks/ is live! [09:14:46] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.4 refs T350080 [09:15:03] T350080: 1.42.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T350080 [09:16:00] (03CR) 10CI reject: [V: 04-1] debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) (owner: 10Arnaudb) [09:18:01] tgr: hi, train is in group0, in case you want to check T347223 [09:18:02] T347223: Exception: Key contains invalid characters: centralauth:central-login-complete-token:1�À§À¢%2527%2522 - https://phabricator.wikimedia.org/T347223 [09:22:35] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [09:23:20] (03PS2) 10Arnaudb: debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) [09:23:46] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [09:23:52] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [09:26:29] (03CR) 10CI reject: [V: 04-1] debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) (owner: 10Arnaudb) [09:27:54] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-drmrs and A:cp [09:28:40] (03PS1) 10Ayounsi: Remove include for old esams-drmrs link [dns] - 10https://gerrit.wikimedia.org/r/972332 (https://phabricator.wikimedia.org/T347892) [09:28:42] (03CR) 10Vgutierrez: [C: 03+1] haproxy: enable healthcheck-dedicated backend in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/972320 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:29:46] (03CR) 10CI reject: [V: 04-1] Remove include for old esams-drmrs link [dns] - 10https://gerrit.wikimedia.org/r/972332 (https://phabricator.wikimedia.org/T347892) (owner: 10Ayounsi) [09:31:03] (03CR) 10Cathal Mooney: [C: 03+1] Remove include for old esams-drmrs link [dns] - 10https://gerrit.wikimedia.org/r/972332 (https://phabricator.wikimedia.org/T347892) (owner: 10Ayounsi) [09:32:51] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:34:36] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: enable healthcheck-dedicated backend in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/972320 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:34:39] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove drmrs-esams IPs - ayounsi@cumin1001" [09:34:45] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/972332 (https://phabricator.wikimedia.org/T347892) (owner: 10Ayounsi) [09:34:55] !log restarting blazegraph on wdqs1007 (stuck for 10+hours) [09:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:55] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:35:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove drmrs-esams IPs - ayounsi@cumin1001" [09:35:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:36:29] (03CR) 10Ayounsi: [C: 03+2] Remove include for old esams-drmrs link [dns] - 10https://gerrit.wikimedia.org/r/972332 (https://phabricator.wikimedia.org/T347892) (owner: 10Ayounsi) [09:36:35] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:41:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:42:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/972061 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [09:42:43] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:42:48] (03PS1) 10Filippo Giunchedi: alertmanager: add alerts-triage on /triage [puppet] - 10https://gerrit.wikimedia.org/r/972335 (https://phabricator.wikimedia.org/T350014) [09:43:03] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:44:22] (03PS1) 10Fabfur: haproxy: enable healthcheck-dedicated backend in esams [puppet] - 10https://gerrit.wikimedia.org/r/972336 (https://phabricator.wikimedia.org/T348851) [09:44:27] (03CR) 10JMeybohm: [C: 03+2] Update eventstreams to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967402 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:44:33] (03PS1) 10Filippo Giunchedi: pontoon: deal with empty hosts in public_lb [puppet] - 10https://gerrit.wikimedia.org/r/972337 [09:44:58] (03CR) 10CI reject: [V: 04-1] alertmanager: add alerts-triage on /triage [puppet] - 10https://gerrit.wikimedia.org/r/972335 (https://phabricator.wikimedia.org/T350014) (owner: 10Filippo Giunchedi) [09:45:21] (03Merged) 10jenkins-bot: Update eventstreams to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967402 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:46:15] (03PS2) 10Filippo Giunchedi: alertmanager: add alerts-triage on /triage [puppet] - 10https://gerrit.wikimedia.org/r/972335 (https://phabricator.wikimedia.org/T350014) [09:46:53] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: deal with empty hosts in public_lb [puppet] - 10https://gerrit.wikimedia.org/r/972337 (owner: 10Filippo Giunchedi) [09:48:55] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/972336 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:49:40] 10SRE, 10Patch-For-Review: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10CodeReviewBot) mvernon opened https://gitlab.wikimedia.org/repos/sre/wmf-debci/-/merge_requests/2 If USEBACKPORTS set, tell apt to use the relevant -backports suite [09:49:46] (03CR) 10Fabfur: haproxy: enable healthcheck-dedicated backend in esams [puppet] - 10https://gerrit.wikimedia.org/r/972336 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:50:43] 10SRE, 10Patch-For-Review: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10MatthewVernon) (I thought `USEBACKPORTS` was better, but I'm not wedded to that) [09:52:35] (03PS1) 10Giuseppe Lavagetto: modules: add base.statsd:1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972339 [09:52:37] (03PS1) 10Giuseppe Lavagetto: base.statsd: add prestop sleep helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/972340 [09:52:39] (03PS1) 10Giuseppe Lavagetto: mediawiki: update statsd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/972341 [09:52:41] (03PS1) 10Giuseppe Lavagetto: mw-debug: add statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972342 (https://phabricator.wikimedia.org/T240685) [09:52:43] (03PS1) 10Giuseppe Lavagetto: mediawiki: add statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972343 (https://phabricator.wikimedia.org/T240685) [09:53:11] !log installing nss security updates [09:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:13] (03CR) 10Santhosh: [C: 03+1] Update cxserver to 2023-11-07-081511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/972323 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [09:53:28] (03CR) 10CI reject: [V: 04-1] base.statsd: add prestop sleep helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/972340 (owner: 10Giuseppe Lavagetto) [09:53:33] (03CR) 10CI reject: [V: 04-1] mediawiki: update statsd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/972341 (owner: 10Giuseppe Lavagetto) [09:53:35] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 10 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/972336 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:53:37] (03CR) 10CI reject: [V: 04-1] mw-debug: add statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972342 (https://phabricator.wikimedia.org/T240685) (owner: 10Giuseppe Lavagetto) [09:53:40] (03CR) 10CI reject: [V: 04-1] mediawiki: add statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972343 (https://phabricator.wikimedia.org/T240685) (owner: 10Giuseppe Lavagetto) [09:54:19] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [09:55:11] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [09:58:54] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [09:59:32] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [10:00:27] 10SRE, 10ops-esams, 10Documentation: Update on-wiki documentation about esams - https://phabricator.wikimedia.org/T344129 (10ayounsi) 05Resolved→03Open There are some outdated pages, for example: * https://wikitech.wikimedia.org/wiki/Esams_data_center * https://wikitech.wikimedia.org/wiki/Knams_data_cent... [10:00:32] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi) [10:01:42] (03CR) 10Vgutierrez: [C: 03+1] haproxy: enable healthcheck-dedicated backend in esams [puppet] - 10https://gerrit.wikimedia.org/r/972336 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [10:02:29] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [10:03:22] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [10:03:40] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [10:03:42] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [10:04:54] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [10:09:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [10:09:06] (03PS2) 10Giuseppe Lavagetto: base.statsd: add prestop sleep helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/972340 [10:09:08] (03PS2) 10Giuseppe Lavagetto: mediawiki: update statsd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/972341 [10:09:10] (03PS2) 10Giuseppe Lavagetto: mw-debug: add statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972342 (https://phabricator.wikimedia.org/T240685) [10:09:12] (03PS2) 10Giuseppe Lavagetto: mediawiki: add statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972343 (https://phabricator.wikimedia.org/T240685) [10:10:52] !log installing dbus security updates on bookworm [10:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:17] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [10:11:27] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_test_cluster::coordinator [10:11:28] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [10:12:13] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [10:15:01] 10SRE, 10collaboration-services, 10Patch-For-Review: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10CodeReviewBot) mvernon merged https://gitlab.wikimedia.org/repos/sre/wmf-debci/-/merge_requests/2 If USEBACKPORTS set, tell apt to use the relevant -backports suite [10:15:09] 10SRE, 10collaboration-services, 10Patch-For-Review: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10LSobanski) a:03MatthewVernon [10:16:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [10:16:35] (03PS1) 10Muehlenhoff: Move analytics_test_cluster::coordinator to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972344 (https://phabricator.wikimedia.org/T349619) [10:17:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-drmrs and A:cp [10:21:03] (03CR) 10Muehlenhoff: [C: 03+2] Move analytics_test_cluster::coordinator to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972344 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:24:41] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:30:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_test_cluster::coordinator [10:32:08] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: enable healthcheck-dedicated backend in esams [puppet] - 10https://gerrit.wikimedia.org/r/972336 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [10:33:15] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [10:35:53] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_test_cluster::hadoop::master [10:36:39] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [10:37:01] RECOVERY - Check systemd state on mw2400 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:20] (03PS1) 10Muehlenhoff: Switch analytics_test_cluster::hadoop::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972345 (https://phabricator.wikimedia.org/T349619) [10:45:30] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_test_cluster::hadoop::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972345 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:45:51] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:11] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:48:23] (03PS2) 10Hnowlan: wikifeeds: add rest-gateway config and bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971407 (https://phabricator.wikimedia.org/T349517) [10:51:30] (03CR) 10JMeybohm: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [10:53:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_test_cluster::hadoop::master [10:54:23] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_test_cluster::hadoop::standby [10:55:15] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:55:55] (03PS1) 10Muehlenhoff: Switch analytics_test_cluster::hadoop::standby to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972347 (https://phabricator.wikimedia.org/T349619) [10:58:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_test_cluster::hadoop::standby to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972347 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:58:34] (03CR) 10JMeybohm: [C: 03+2] Update wikifeeds to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967414 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:59:03] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [10:59:24] (03Merged) 10jenkins-bot: Update wikifeeds to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967414 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1100) [11:02:07] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [11:02:22] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [11:03:06] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[1075-1090].eqiad.wmnet} and A:cp [11:03:51] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [11:04:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_test_cluster::hadoop::standby [11:04:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_test_cluster::hadoop::worker [11:06:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:07:46] (03PS1) 10Giuseppe Lavagetto: mediawiki: clarify comment on swift egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/972349 [11:08:04] (03PS1) 10Muehlenhoff: Switch analytics_test_cluster::hadoop::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972350 (https://phabricator.wikimedia.org/T349619) [11:09:36] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_test_cluster::hadoop::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972350 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:09:53] (03CR) 10Hnowlan: [C: 03+1] mediawiki: clarify comment on swift egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/972349 (owner: 10Giuseppe Lavagetto) [11:11:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:11:13] (03PS7) 10Effie Mouzeli: ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [11:11:55] (03CR) 10CI reject: [V: 04-1] ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [11:13:44] (03PS8) 10Effie Mouzeli: ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [11:13:46] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [11:14:36] (03CR) 10JMeybohm: [C: 03+2] Update zotero to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967415 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:15:07] (03CR) 10Effie Mouzeli: ipoid: add cronjobs for initialImport and dailyUpdate (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [11:15:12] 10SRE, 10Traffic: HAProxy should use a single backend for Vanish - https://phabricator.wikimedia.org/T349287 (10Fabfur) 05Open→03Resolved This change has been deployed on all DCs along with the changes for T348851 [11:15:36] (03Merged) 10jenkins-bot: Update zotero to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967415 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:15:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:16:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_test_cluster::hadoop::worker [11:17:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: clarify comment on swift egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/972349 (owner: 10Giuseppe Lavagetto) [11:18:28] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [11:18:36] (03Merged) 10jenkins-bot: mediawiki: clarify comment on swift egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/972349 (owner: 10Giuseppe Lavagetto) [11:18:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_test_cluster::presto::server [11:18:48] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:20:19] (03PS1) 10Muehlenhoff: Switch analytics_test_cluster::presto::server to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972352 (https://phabricator.wikimedia.org/T349619) [11:20:41] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:21:06] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:22:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_test_cluster::presto::server to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972352 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:22:37] (03CR) 10Phuedx: [C: 03+1] wikifeeds: add rest-gateway config and bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971407 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [11:23:31] (03CR) 10Jbond: [C: 03+1] "lgtm baring comments on the comment :)" [puppet] - 10https://gerrit.wikimedia.org/r/972061 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [11:24:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/972318 (https://phabricator.wikimedia.org/T344164) (owner: 10Volans) [11:26:26] 10SRE, 10collaboration-services, 10Patch-For-Review: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10CodeReviewBot) mvernon opened https://gitlab.wikimedia.org/repos/sre/wmf-debci/-/merge_requests/3 Explicitly use echo -e to force backslash-interpretation [11:26:41] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: fix parameter passed to reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/972318 (https://phabricator.wikimedia.org/T344164) (owner: 10Volans) [11:26:48] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:27:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_test_cluster::presto::server [11:27:15] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:27:38] (03CR) 10Btullis: [C: 03+2] "This looks good to me. Thanks very much. I'll try to deploy it today." [deployment-charts] - 10https://gerrit.wikimedia.org/r/969345 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:28:29] (03Merged) 10jenkins-bot: Update datahub to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969345 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:30:37] (03Merged) 10jenkins-bot: sre.ganeti.makevm: fix parameter passed to reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/972318 (https://phabricator.wikimedia.org/T344164) (owner: 10Volans) [11:32:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:32:51] !log reset PIC in cr1-eqiad slot 1/1 to enable port et-1/1/2 at 100G for new transport (T350504) [11:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:03] T350504: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 [11:33:47] !log btullis@cumin1001 Added views for new wiki: zghwiki T350240 [11:33:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:33:51] T350240: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 [11:34:15] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:35:24] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_test_cluster::client [11:36:39] (03PS1) 10Muehlenhoff: Switch analytics_test_cluster::client to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972353 (https://phabricator.wikimedia.org/T349619) [11:37:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10cmooney) I added the config to set the port to 100G and bounced the PIC (the other, asw facing, ports on it were already VRRP backup). Light levels inbound look g... [11:37:38] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:38:13] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_test_cluster::client to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972353 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:42:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[1075-1090].eqiad.wmnet} and A:cp [11:43:36] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [11:43:51] (03PS1) 10Fabfur: haproxy: re-set varnish maxconn for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) [11:44:27] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [11:45:20] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:45:26] !log btullis@cumin1001 Added views for new wiki: bbcwiki T350372 [11:45:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:45:31] T350372: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 [11:46:03] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [11:46:35] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [11:46:44] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [11:47:16] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [11:47:17] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:47:23] !log btullis@cumin1001 Added views for new wiki: bjnwikiquote T350234 [11:47:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:47:27] T350234: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 [11:48:00] (03PS6) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) [11:48:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_test_cluster::client [11:48:43] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:48:49] !log btullis@cumin1001 Added views for new wiki: dgawiki T350228 [11:48:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:48:52] T350228: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 [11:49:32] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 10 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur) [11:50:44] (03CR) 10Hnowlan: [C: 03+2] wikifeeds: add rest-gateway config and bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971407 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [11:51:47] (03Merged) 10jenkins-bot: wikifeeds: add rest-gateway config and bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971407 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [11:52:50] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [11:53:09] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [11:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:53:31] hnowlan: you will see quite a diff for wikifeeds in prod [11:53:49] https://phabricator.wikimedia.org/T300033#9312006 [11:56:32] jayme: oof, good to know [11:56:59] hnowlan: all the envoy stuff "should be fine" ... but I was afraid to deploy the config change to prod [11:57:30] 10SRE, 10collaboration-services, 10Patch-For-Review: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10CodeReviewBot) mvernon merged https://gitlab.wikimedia.org/repos/sre/wmf-debci/-/merge_requests/3 Explicitly use echo -e to force backslash-interpretation [11:57:47] jayme: yeeeeah dunno how I feel about that either. yiannis is pto [11:58:05] hmpf [11:58:20] (03PS1) 10Muehlenhoff: puppet.conf: Double number_of_facts_soft_limit [puppet] - 10https://gerrit.wikimedia.org/r/972357 [11:58:51] (03CR) 10CI reject: [V: 04-1] puppet.conf: Double number_of_facts_soft_limit [puppet] - 10https://gerrit.wikimedia.org/r/972357 (owner: 10Muehlenhoff) [11:59:48] (03PS4) 10Jbond: mariadb - wikireplicas: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) [11:59:50] (03PS2) 10Jbond: mariadb - wmcs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) [11:59:52] (03PS2) 10Jbond: mariadb - analytics: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) [11:59:54] (03PS2) 10Jbond: mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) [11:59:56] (03PS2) 10Jbond: mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) [11:59:58] (03PS2) 10Jbond: mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) [12:00:07] hnowlan: it looks like it's disabled in prod...but it's still a config change so I was hoping for a quick clarification from nemo-yiannis :/ [12:00:39] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:01:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/332/con" [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:02:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/334/console" [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:02:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/333/con" [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:03:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/335/con" [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:04:55] jayme: yeah I think my change can wait until he's back [12:05:11] hnowlan: you know how long he's out? [12:05:34] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [12:07:10] jayme: oof, 2 weeks [12:07:48] hmm...that's quite some time. Is there someone else we can ask? [12:08:05] (03CR) 10Vgutierrez: [C: 04-1] "Ifd9b71d5c6fc8d9e9b2772c3df1b7d4a736d620d never got merged, let's go with 20k first please" [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur) [12:08:47] (03PS1) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) [12:09:08] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [12:09:14] (03PS2) 10Fabfur: haproxy: re-set varnish maxconn for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) [12:09:21] (03CR) 10Fabfur: haproxy: re-set varnish maxconn for ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur) [12:10:17] (03CR) 10Jbond: "large amounts of facts can causes issues for puppetdb so i think it is useful to have this warning at the current value to catch any issue" [puppet] - 10https://gerrit.wikimedia.org/r/972357 (owner: 10Muehlenhoff) [12:11:03] !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [12:11:28] (03CR) 10Jbond: puppet.conf: Double number_of_facts_soft_limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972357 (owner: 10Muehlenhoff) [12:12:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/339/con" [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:13:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/341/con" [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:14:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/340/con" [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:14:34] !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [12:14:39] (03PS5) 10Jbond: mariadb - wikireplicas: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) [12:14:41] (03PS3) 10Jbond: mariadb - wmcs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) [12:14:43] (03PS3) 10Jbond: mariadb - analytics: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) [12:14:45] (03PS3) 10Jbond: mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) [12:14:47] (03PS3) 10Jbond: mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) [12:14:49] (03PS3) 10Jbond: mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) [12:16:16] (03CR) 10Muehlenhoff: puppet.conf: Double number_of_facts_soft_limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972357 (owner: 10Muehlenhoff) [12:20:15] (03PS4) 10Jbond: mariadb - wmcs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) [12:20:17] (03PS4) 10Jbond: mariadb - analytics: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) [12:20:19] (03PS4) 10Jbond: mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) [12:20:21] (03PS4) 10Jbond: mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) [12:20:23] (03PS4) 10Jbond: mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) [12:21:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:23:10] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:24:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::ui::superset::staging [12:25:57] (03PS1) 10Muehlenhoff: Switch analytics_cluster::ui::superset::staging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972361 (https://phabricator.wikimedia.org/T349619) [12:26:11] (03PS2) 10Muehlenhoff: Switch analytics_cluster::ui::superset::staging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972361 (https://phabricator.wikimedia.org/T349619) [12:28:15] (03PS1) 10Hnowlan: wikifeeds: configure log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/972362 (https://phabricator.wikimedia.org/T349517) [12:28:20] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::ui::superset::staging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972361 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:32:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:33:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::ui::superset::staging [12:36:29] (03PS3) 10ArielGlenn: use virtual db domain for CentralAuth and GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [12:37:40] (03CR) 10ArielGlenn: use virtual db domain for CentralAuth and GlobalBlocking (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [12:42:53] (03PS6) 10Jbond: mariadb - wikireplicas: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) [12:42:55] (03PS5) 10Jbond: mariadb - wmcs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) [12:42:57] (03PS5) 10Jbond: mariadb - analytics: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) [12:42:59] (03PS5) 10Jbond: mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) [12:43:01] (03PS5) 10Jbond: mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) [12:43:03] (03PS5) 10Jbond: mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) [12:43:05] (03PS1) 10Jbond: mariadb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972363 (https://phabricator.wikimedia.org/T340741) [12:43:07] (03PS1) 10Jbond: analytics_cluster/ui/superset: Update ssl ca [puppet] - 10https://gerrit.wikimedia.org/r/972364 (https://phabricator.wikimedia.org/T340741) [12:43:09] (03PS1) 10Jbond: airflow: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972365 (https://phabricator.wikimedia.org/T340741) [12:43:11] (03PS1) 10Jbond: dragonfly: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972366 (https://phabricator.wikimedia.org/T340741) [12:43:13] (03PS1) 10Jbond: orchestrator: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972367 (https://phabricator.wikimedia.org/T340741) [12:43:15] (03PS1) 10Jbond: prometheus: update ssl CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) [12:43:17] (03PS1) 10Jbond: kafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 [12:43:19] (03PS1) 10Jbond: etcd: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) [12:43:21] (03PS1) 10Jbond: netbox: update to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972371 (https://phabricator.wikimedia.org/T340741) [12:43:23] (03PS1) 10Jbond: puppetdb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972372 (https://phabricator.wikimedia.org/T340741) [12:43:28] (03PS2) 10Jbond: analytics_cluster/ui/superset: Update ssl ca [puppet] - 10https://gerrit.wikimedia.org/r/972364 (https://phabricator.wikimedia.org/T340741) [12:43:48] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972364 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:43:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:44:01] (03PS2) 10Jbond: airflow: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972365 (https://phabricator.wikimedia.org/T340741) [12:44:10] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972365 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:44:19] (03PS2) 10Jbond: dragonfly: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972366 (https://phabricator.wikimedia.org/T340741) [12:44:25] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972366 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:44:32] (03PS2) 10Jbond: orchestrator: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972367 (https://phabricator.wikimedia.org/T340741) [12:44:37] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972367 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:44:47] (03PS2) 10Jbond: prometheus: update ssl CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) [12:44:52] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:45:01] (03PS2) 10Jbond: kafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 [12:45:08] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972369 (owner: 10Jbond) [12:45:18] (03PS2) 10Jbond: etcd: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) [12:45:23] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:45:32] (03PS2) 10Jbond: netbox: update to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972371 (https://phabricator.wikimedia.org/T340741) [12:45:34] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [12:45:37] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972371 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:45:47] (03PS2) 10Jbond: puppetdb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972372 (https://phabricator.wikimedia.org/T340741) [12:45:52] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972372 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:46:11] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [12:46:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [12:47:05] (03CR) 10Jbond: mariadb - wmcs: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:48:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/972363 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:48:20] (03CR) 10FNegri: "I finished going through all the changes, very nice job! And thanks for the small cleanups here and there." [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [12:49:31] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: netbox::frontend [12:53:53] (03CR) 10CI reject: [V: 04-1] kafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (owner: 10Jbond) [12:54:26] (03CR) 10Phuedx: [C: 03+1] wikifeeds: configure log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/972362 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [12:55:48] (03PS1) 10Jbond: netbox::frontend: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972374 (https://phabricator.wikimedia.org/T349619) [12:56:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [12:56:48] (03CR) 10Jbond: [C: 03+2] netbox::frontend: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972374 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:56:50] (03CR) 10CI reject: [V: 04-1] puppetdb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972372 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:58:18] (03CR) 10CI reject: [V: 04-1] puppetdb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972372 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:58:35] (03CR) 10CI reject: [V: 04-1] kafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (owner: 10Jbond) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1300) [13:09:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:22] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: netbox::frontend [13:09:49] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: netbox::database [13:11:38] (03PS1) 10Jbond: netbox::database: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972376 (https://phabricator.wikimedia.org/T340741) [13:12:21] (03CR) 10Jbond: [C: 03+2] netbox::database: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972376 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:12:46] (03CR) 10Aklapper: [C: 03+1] "Giving a +1 as I don't have permissions to give +2" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/969515 (https://phabricator.wikimedia.org/T294754) (owner: 10Pppery) [13:12:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:15:47] jouncebot: nowandnext [13:15:47] For the next 0 hour(s) and 44 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1300) [13:15:47] In 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1400) [13:16:14] (03PS2) 10Jforrester: mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971400 (https://phabricator.wikimedia.org/T350004) (owner: 10Physikerwelt) [13:16:18] (03CR) 10Jforrester: [C: 03+2] mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971400 (https://phabricator.wikimedia.org/T350004) (owner: 10Physikerwelt) [13:17:08] (03Merged) 10jenkins-bot: mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971400 (https://phabricator.wikimedia.org/T350004) (owner: 10Physikerwelt) [13:18:16] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [13:18:41] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [13:19:07] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [13:19:37] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [13:19:43] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [13:20:12] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [13:21:27] (03PS2) 10Jforrester: wikifunctions: Bump evaluators to 2023-11-06-164826 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971998 (https://phabricator.wikimedia.org/T281500) [13:22:28] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump evaluators to 2023-11-06-164826 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971998 (https://phabricator.wikimedia.org/T281500) (owner: 10Jforrester) [13:23:13] (03PS2) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) [13:23:16] (03PS3) 10Arnaudb: debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) [13:23:26] (03Merged) 10jenkins-bot: wikifunctions: Bump evaluators to 2023-11-06-164826 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971998 (https://phabricator.wikimedia.org/T281500) (owner: 10Jforrester) [13:23:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: netbox::database [13:24:20] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:24:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:12] (03CR) 10CI reject: [V: 04-1] debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) (owner: 10Arnaudb) [13:29:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:50] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet2006 - https://phabricator.wikimedia.org/T350479 (10Volans) The code is not checking if he autoselection of the parent is None or not. That said re-running the script now works fine. What was changed in the Netbo... [13:30:21] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:31:39] (03PS2) 10EoghanGaffney: [apt-staging] Add dns names for apt-staging.wm.o and discovery.w [dns] - 10https://gerrit.wikimedia.org/r/971486 [13:33:35] 10SRE, 10Infrastructure-Foundations, 10netops: Do we need to generate aggregates for LVS service IP ranges? - https://phabricator.wikimedia.org/T350354 (10ayounsi) That predates me so the real reason might be lost or not valid anymore. However I see that they're redistributed in OSPF: `set policy-options po... [13:37:34] (HelmReleaseBadStatus) firing: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:38:32] Eurgh. [13:39:21] helmfile not returning, no output after `Upgrading release=python-evaluator, chart=wmf-stable/function-evaluator`. Should I abort? [13:40:40] (03PS3) 10Jbond: dragonfly: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972366 (https://phabricator.wikimedia.org/T340741) [13:41:02] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972366 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:41:51] (03PS1) 10Jgreen: Unify fr-tech/fr-tech-ops icinga contact groups to just fr-tech. [puppet] - 10https://gerrit.wikimedia.org/r/972378 (https://phabricator.wikimedia.org/T348559) [13:42:34] (HelmReleaseBadStatus) resolved: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:42:36] (03PS3) 10Jbond: analytics_cluster/ui/superset: Update ssl ca [puppet] - 10https://gerrit.wikimedia.org/r/972364 (https://phabricator.wikimedia.org/T340741) [13:42:48] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972364 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:42:55] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: sync [13:43:54] (03PS3) 10Jbond: orchestrator: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972367 (https://phabricator.wikimedia.org/T340741) [13:44:03] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972367 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:44:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:56] (03PS3) 10Jbond: kafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) [13:46:09] (03CR) 10Jforrester: [C: 04-1] wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) (owner: 10Jforrester) [13:48:42] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:48:43] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:48:55] (03CR) 10Jbond: [C: 03+2] netbox: update to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972371 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:49:01] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:49:02] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:49:10] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:50:49] (03PS3) 10Jbond: puppetdb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972372 (https://phabricator.wikimedia.org/T340741) [13:50:58] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972372 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:52:46] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:53:01] (03CR) 10JMeybohm: ipoid: add cronjobs for initialImport and dailyUpdate (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [13:54:49] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/972366 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:56:16] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/972364 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:56:59] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/972367 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:57:01] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add dns names for apt-staging.wm.o and discovery.w [dns] - 10https://gerrit.wikimedia.org/r/971486 (owner: 10EoghanGaffney) [13:57:52] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1400). [14:00:06] James_F and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] Hey hey. [14:00:20] hello :) [14:00:28] I was going to self-deploy if that's OK. [14:00:35] ^ [14:00:56] i'll start MatmaRex's scripts [14:01:04] <3 [14:01:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) (owner: 10Jforrester) [14:02:03] (03Merged) 10jenkins-bot: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) (owner: 10Jforrester) [14:03:16] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:966570|[wikifunctions] Alter site to General Availability (T349054 T349061 T349063 T349080 T349082)]] [14:03:19] hi, thanks [14:03:33] T349082: User rights: Logged in users can edit user-contributed objects of non-restricted types - https://phabricator.wikimedia.org/T349082 [14:03:34] T349063: User rights: Logged in users can edit function input labels and key labels - https://phabricator.wikimedia.org/T349063 [14:03:34] T349054: Adjust and improve Wikifunctions user rights grants for General Availability - https://phabricator.wikimedia.org/T349054 [14:03:34] T349061: User rights: Logged in users can create functions, implementations, tests and others - https://phabricator.wikimedia.org/T349061 [14:03:35] T349080: User rights: Logged in users can edit user-contributed functions that are not-running - https://phabricator.wikimedia.org/T349080 [14:03:50] (03PS1) 10Ladsgroup: mediawiki: Add purge cronjob for pc4 [puppet] - 10https://gerrit.wikimedia.org/r/972382 (https://phabricator.wikimedia.org/T350367) [14:04:43] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:966570|[wikifunctions] Alter site to General Availability (T349054 T349061 T349063 T349080 T349082)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:04:59] !log jforrester@deploy2002 jforrester: Continuing with sync [14:05:50] MatmaRex: logs for enwiki https://phabricator.wikimedia.org/F41465575. i guess i should start it with `--start '["66947143"]'`? [14:05:56] (03PS37) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:06:26] (03PS9) 10Effie Mouzeli: ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [14:06:36] urbanecm: oh, yup. thanks [14:08:02] ok [14:08:37] (03CR) 10Effie Mouzeli: ipoid: add cronjobs for initialImport and dailyUpdate (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [14:09:08] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:09:15] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:09:29] MatmaRex: all started [14:09:46] !log mwmaint2002: Start multiple instances of extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php (T315510#9312431) [14:09:48] (03CR) 10Jbond: [C: 03+2] puppetdb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972372 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:50] thank you! [14:09:55] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:10:17] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:966570|[wikifunctions] Alter site to General Availability (T349054 T349061 T349063 T349080 T349082)]] (duration: 07m 00s) [14:10:30] T349082: User rights: Logged in users can edit user-contributed objects of non-restricted types - https://phabricator.wikimedia.org/T349082 [14:10:31] T349063: User rights: Logged in users can edit function input labels and key labels - https://phabricator.wikimedia.org/T349063 [14:10:31] T349054: Adjust and improve Wikifunctions user rights grants for General Availability - https://phabricator.wikimedia.org/T349054 [14:10:31] T349061: User rights: Logged in users can create functions, implementations, tests and others - https://phabricator.wikimedia.org/T349061 [14:10:32] T349080: User rights: Logged in users can edit user-contributed functions that are not-running - https://phabricator.wikimedia.org/T349080 [14:11:22] 10SRE, 10Infrastructure-Foundations, 10netops: Do we need to generate aggregates for LVS service IP ranges? - https://phabricator.wikimedia.org/T350354 (10BBlack) I don't suspect it serves any real purpose at present, unless it was to avoid some filtering that exists elsewhere to avoid cross-site sharing of... [14:13:25] (03CR) 10Btullis: [C: 03+1] "Sorry for the delay, I thought I had already reviewed this." [puppet] - 10https://gerrit.wikimedia.org/r/970732 (owner: 10Muehlenhoff) [14:13:32] (03Abandoned) 10Jforrester: wikitech: Re-disable UserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504814 (owner: 10Jforrester) [14:14:21] (03CR) 10Btullis: [C: 03+1] "Looks good, many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/972364 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:14:42] (03PS3) 10Jforrester: [DNM] Drop the 'inactive' user group everywhere, it's unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472605 [14:15:02] (03CR) 10Btullis: [C: 03+1] kafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:15:13] (03PS1) 10Stevemunene: Revert "Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde"""" [puppet] - 10https://gerrit.wikimedia.org/r/972250 [14:16:16] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/972365 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:16:19] (03CR) 10CI reject: [V: 04-1] Revert "Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde"""" [puppet] - 10https://gerrit.wikimedia.org/r/972250 (owner: 10Stevemunene) [14:16:36] (03CR) 10Marostegui: [C: 03+1] "let's go for this. I'd like to get one of the réplicas with Mariadb restarted once puppet has run, before merging the other patches that a" [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:18:11] (03PS2) 10Stevemunene: Revert "Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde"""" [puppet] - 10https://gerrit.wikimedia.org/r/972250 [14:19:45] (03CR) 10Btullis: [C: 03+1] "Looks good. It's a noop on the etcd cluster that my team manages. Might want to get ServiceOps to review regarding the main cluster." [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:20:49] (03CR) 10Btullis: [C: 03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/969128 (owner: 10Muehlenhoff) [14:22:18] (03CR) 10Btullis: [C: 03+1] dse-k8s: remove rdf-streaming-updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/966902 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:22:26] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/345/console" [puppet] - 10https://gerrit.wikimedia.org/r/972250 (owner: 10Stevemunene) [14:23:21] (03CR) 10Btullis: [C: 03+1] "Feel free to self-merge without a +1 on the labs/private repo, unless you're specifically requesting a review for any reason." [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:27:10] 10SRE, 10Math, 10RESTBase-API, 10MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10Physikerwelt) 05Open→03Resolved I checked https://nl.wikipedia.org/w/index.p... [14:27:59] urbanecm, i have another namespace patch like the one from yesterday… is there time for it? [14:28:09] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/972385 [14:28:12] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:34] oh, if you give me a couple of minutes, i'll have two [14:31:18] 10Puppet, 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10User-jbond: PKI server don't reimage cleanly - https://phabricator.wikimedia.org/T270269 (10jbond) [14:31:45] 10SRE, 10collaboration-services: Dependencies from backports in wmf-debci - https://phabricator.wikimedia.org/T350658 (10MatthewVernon) 05Open→03Resolved Done, and the new CI variable is documented too :) [14:34:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [14:34:43] (03CR) 10Btullis: [V: 03+1 C: 03+2] Tidy up analytics.pp whitespace [puppet] - 10https://gerrit.wikimedia.org/r/969142 (owner: 10Btullis) [14:35:16] (03CR) 10Btullis: [C: 03+2] Configure the new mariadb servers to be replicas [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [14:38:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:23] 10SRE, 10Math, 10RESTBase-API, 10MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10TheDJ) Was already fixed in last weeks release it seems: Probably turned into an... [14:40:23] (03CR) 10Btullis: [C: 03+2] analytics_cluster/ui/superset: Update ssl ca [puppet] - 10https://gerrit.wikimedia.org/r/972364 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:40:28] (03CR) 10Hnowlan: [C: 03+2] wikifeeds: configure log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/972362 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [14:41:21] (03Merged) 10jenkins-bot: wikifeeds: configure log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/972362 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [14:41:46] urbanecm, ping [14:42:03] jouncebot: nowandnext [14:42:03] For the next 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1400) [14:42:03] In 1 hour(s) and 17 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1600) [14:43:47] we have an hour, should be doable [14:44:06] Jhs: can you put links on cal? [14:44:37] sure [14:46:04] {{done}}, but only for the main patch for now (i have to wait for it to be merged to create a cherry pick, right?) [14:49:04] Jhs: technically, you can create the chery pick even now [14:49:25] Jhs: you said you had two patches? [14:49:51] urbanecm, yeah, sorry, i just added the second language to the first patch instead of creating another patch [14:49:56] (03PS1) 10Urbanecm: [Languages] Add namespaces names for dga and bbc-latn [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972251 [14:49:57] ah, makes sesnse [14:50:03] (03CR) 10Urbanecm: [C: 03+2] [Languages] Add namespaces names for dga and bbc-latn [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972251 (owner: 10Urbanecm) [14:50:12] (03PS1) 10Urbanecm: [Languages] Add namespaces names for dga and bbc-latn [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972252 [14:50:15] (03CR) 10Urbanecm: [C: 03+2] [Languages] Add namespaces names for dga and bbc-latn [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972252 (owner: 10Urbanecm) [14:50:22] Jhs: cherry picks created, waiting on CI [14:50:53] wmfgreat, thank you [14:51:18] * urbanecm is unsure what wmf means in this context, but okay :D [14:51:46] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:52:02] hehe, i apparently started typing the cherry pick name before i left this window XD [14:52:08] :D [14:52:20] i thought it was a joke i'm not getting :D [14:52:27] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [14:52:30] wmfno :P [14:52:37] now it is :D [14:52:43] hehehe [14:54:22] (03PS1) 10Marostegui: production-m5.sql: Add ALTER to ipoid_rw [puppet] - 10https://gerrit.wikimedia.org/r/972388 (https://phabricator.wikimedia.org/T305114) [14:54:41] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:30] urbanecm, btw, are you able/willing to add wikidata sitelink support for the four new wikis too, or is that still an Amir-only thing? [14:55:31] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::ui::superset [14:55:48] Jhs: depends on how often it breaks :D [14:56:23] * Jhs has no idea, (un)fortunately [14:56:26] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [14:56:42] (03PS1) 10Muehlenhoff: Switch analytics_cluster::ui::superset to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972389 (https://phabricator.wikimedia.org/T349619) [14:57:11] (03CR) 10Btullis: [C: 03+2] airflow: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972365 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:57:29] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add ALTER to ipoid_rw [puppet] - 10https://gerrit.wikimedia.org/r/972388 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [14:58:44] me neither, as i never ran it. [14:58:52] urbanecm, ah, amir is on it already 👍 [14:58:59] good :) [14:59:04] (thanks!) [14:59:38] (03CR) 10JMeybohm: [C: 03+1] "LGTM modulo that this obviously needs to be backported to the app.job module itself" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [14:59:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::ui::superset to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972389 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:00:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [15:01:33] 10SRE, 10Fundraising-Backlog, 10SRE Observability, 10fundraising-tech-ops, 10Patch-For-Review: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10Jgreen) a:03Jgreen [15:01:53] 10SRE, 10SRE Observability, 10fundraising-tech-ops, 10Patch-For-Review: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10Jgreen) [15:03:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [15:06:00] (03CR) 10JMeybohm: [C: 03+1] dragonfly: update SSL certs to use combined CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972366 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:08:03] (03CR) 10Ladsgroup: use virtual db domain for CentralAuth and GlobalBlocking (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [15:08:48] jouncebot: next [15:08:48] In 0 hour(s) and 51 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1600) [15:10:08] (03CR) 10Hnowlan: [C: 03+1] "Not 100% certain as to how rollout/approvals works for mediawiki-config but lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [15:10:12] (03Merged) 10jenkins-bot: [Languages] Add namespaces names for dga and bbc-latn [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972251 (owner: 10Urbanecm) [15:10:35] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972393 [15:11:03] marostegui: fyi i'm about to start a MW deployment. wanna go first? [15:11:09] (03CR) 10Marostegui: [C: 03+1] mediawiki: Add purge cronjob for pc4 [puppet] - 10https://gerrit.wikimedia.org/r/972382 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [15:11:30] urbanecm: nah, you can go! [15:11:45] urbanecm: more or less how long you think it'll take you? [15:11:46] ack. i'll ping you when i finish then [15:11:49] Sweet! [15:12:41] (03Merged) 10jenkins-bot: [Languages] Add namespaces names for dga and bbc-latn [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972252 (owner: 10Urbanecm) [15:12:53] (03CR) 10Btullis: [C: 03+1] "Could we update the commit message to make it clear that this is for varnishkafka please, otherwise it makes it seem like this is going to" [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:13:03] (03CR) 10Btullis: kafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:13:28] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1005.eqiad.wmnet with OS bookworm [15:13:30] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972393 (owner: 10Marostegui) [15:13:34] Jhs: starting and let's see [15:13:43] (03CR) 10Marostegui: [C: 03+1] "Let's restart mariadb on one of the dbstore hosts to make sure everything is okay once it is merged" [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:13:51] (03PS2) 10Ladsgroup: mediawiki: Add purge cronjob for pc4 [puppet] - 10https://gerrit.wikimedia.org/r/972382 (https://phabricator.wikimedia.org/T350367) [15:13:52] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:972251|[Languages] Add namespaces names for dga and bbc-latn]], [[gerrit:972252|[Languages] Add namespaces names for dga and bbc-latn]] [15:13:57] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Add purge cronjob for pc4 [puppet] - 10https://gerrit.wikimedia.org/r/972382 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [15:14:06] marostegui: oh, sorry, didn't see the q. hopefully under 30 mins. [15:14:10] urbanecm, 👍 [15:14:14] urbanecm: excellent [15:14:15] i'll be afk for a bit [15:14:44] (03PS1) 10Hnowlan: wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) [15:14:54] (03CR) 10Marostegui: [C: 03+1] "Let's restart mariadb on one of the misc hosts to make sure it is all good" [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:15:25] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:972251|[Languages] Add namespaces names for dga and bbc-latn]], [[gerrit:972252|[Languages] Add namespaces names for dga and bbc-latn]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:15:37] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:15:44] (03CR) 10CI reject: [V: 04-1] wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:15:59] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:16:11] !log urbanecm@deploy2002 urbanecm: Continuing with sync [15:16:33] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:16:37] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:51] (03CR) 10Kosta Harlan: ipoid: add cronjobs for initialImport and dailyUpdate (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [15:20:27] cloudsw is expected, we're reimaging cloudservices1005 [15:20:55] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:21:11] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:21:30] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:972251|[Languages] Add namespaces names for dga and bbc-latn]], [[gerrit:972252|[Languages] Add namespaces names for dga and bbc-latn]] (duration: 07m 37s) [15:21:36] Jhs: should be live [15:21:36] (03CR) 10Ladsgroup: "I will merge this in Dec 4th. Feel free to do so if I forget." [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui) [15:22:02] marostegui: i'm done. feel free to take over :) [15:22:08] (03PS3) 10Marostegui: parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) [15:22:13] (03CR) 10Marostegui: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui) [15:22:24] (03PS1) 10Hashar: Remap serving plugins under /r/plugins/ [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972396 [15:22:28] (03PS8) 10Majavah: openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) [15:22:38] (03CR) 10Majavah: openstack: replace openstack_controllers variable (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [15:23:00] (03CR) 10CI reject: [V: 04-1] openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [15:23:23] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [15:24:05] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:24:06] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:24:17] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:24:18] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:25:03] (03PS1) 10Urbanecm: changeWikiConfig: Add --touch option [extensions/GrowthExperiments] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972257 (https://phabricator.wikimedia.org/T347157) [15:25:14] (03PS1) 10Urbanecm: changeWikiConfig: Add --touch option [extensions/GrowthExperiments] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972258 (https://phabricator.wikimedia.org/T347157) [15:25:26] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:25:27] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:25:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bookworm [15:26:02] (03CR) 10Marostegui: [C: 03+1] "Let me know when merged so I can restart orchestrator" [puppet] - 10https://gerrit.wikimedia.org/r/972367 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:26:08] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage [15:26:32] (03PS10) 10Effie Mouzeli: ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [15:26:35] (03PS1) 10Brouberol: Fix: make sure to enable skein certificate renewal on airflow launchers [puppet] - 10https://gerrit.wikimedia.org/r/972397 (https://phabricator.wikimedia.org/T329398) [15:28:00] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:28:01] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:28:07] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:28:08] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:28:23] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:28:23] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/346/con" [puppet] - 10https://gerrit.wikimedia.org/r/972397 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:28:26] (03PS2) 10Hnowlan: api-gateway: remove puppet_ca_crt references [deployment-charts] - 10https://gerrit.wikimedia.org/r/883636 [15:28:30] (03CR) 10Effie Mouzeli: ipoid: add cronjobs for initialImport and dailyUpdate (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [15:28:49] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage [15:28:51] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:29:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [15:29:18] (03CR) 10JMeybohm: [C: 03+1] "I would have made the same change...or did I?! 😄" [deployment-charts] - 10https://gerrit.wikimedia.org/r/883636 (owner: 10Hnowlan) [15:29:59] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:30:02] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:30:25] urbanecm, aye, confirmed. thank you very much! :) [15:30:26] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:30:43] great! [15:31:37] (03CR) 10Hnowlan: [C: 03+2] api-gateway: remove puppet_ca_crt references [deployment-charts] - 10https://gerrit.wikimedia.org/r/883636 (owner: 10Hnowlan) [15:31:48] (03CR) 10FNegri: [C: 03+1] "Thanks for the replies, I think this can be merged!" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [15:31:56] (03CR) 10DCausse: rdf-streaming-updater: update values for application mode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:32:05] (03CR) 10Majavah: [C: 03+2] openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [15:32:23] (03PS2) 10Brouberol: Fix: make sure to enable skein certificate renewal on airflow launchers [puppet] - 10https://gerrit.wikimedia.org/r/972397 (https://phabricator.wikimedia.org/T329398) [15:32:33] (03Merged) 10jenkins-bot: api-gateway: remove puppet_ca_crt references [deployment-charts] - 10https://gerrit.wikimedia.org/r/883636 (owner: 10Hnowlan) [15:32:37] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972397 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:33:33] (03CR) 10Filippo Giunchedi: "+Jaime and/or Manuel, this LGTM though" [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:34:09] (03CR) 10Marostegui: [C: 03+1] "Let's restart it on a DB host to make sure it keeps working once merged." [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:34:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:34:44] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [15:36:03] (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/972378 (https://phabricator.wikimedia.org/T348559) (owner: 10Jgreen) [15:36:37] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:37:25] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [15:37:32] 10SRE, 10SRE Observability, 10fundraising-tech-ops, 10Patch-For-Review: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10Jgreen) p:05Triage→03Medium [15:38:13] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:38:18] (03Merged) 10jenkins-bot: ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [15:38:19] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:38:50] (03CR) 10Btullis: [C: 03+1] "Looks good. Apologies for missing the mistake in the previous patch." [puppet] - 10https://gerrit.wikimedia.org/r/972397 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:39:03] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:39:19] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:41:52] (03PS1) 10Kamila Součková: [WIP] add kube-state-metrics helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 [15:42:36] !log brion halting requeueTranscodes.php media backfill job insertions for a bit while the queue catches up [15:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:42:55] (03CR) 10Brouberol: [C: 03+2] Fix: make sure to enable skein certificate renewal on airflow launchers [puppet] - 10https://gerrit.wikimedia.org/r/972397 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:43:39] (03CR) 10Jcrespo: [C: 03+1] prometheus: update ssl CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:44:41] (03CR) 10CI reject: [V: 04-1] [WIP] add kube-state-metrics helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (owner: 10Kamila Součková) [15:44:59] (03CR) 10Jcrespo: [C: 03+1] prometheus: update ssl CA to use shared CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:45:38] (03CR) 10Marostegui: [C: 03+1] prometheus: update ssl CA to use shared CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:45:42] (03CR) 10CI reject: [V: 04-1] changeWikiConfig: Add --touch option [extensions/GrowthExperiments] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972257 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [15:46:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [15:47:50] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:48:10] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:48:31] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:48:50] (03PS3) 10JHathaway: reuse-parts.sh: remove bashisms [puppet] - 10https://gerrit.wikimedia.org/r/972061 (https://phabricator.wikimedia.org/T95064) [15:48:50] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:49:02] !log importing openjdk-8 8u392-ga-1~deb10u1 for buster-wikimedia to apt.wikimedia.org (latest Java 8 security fixes) [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:26] (03CR) 10CI reject: [V: 04-1] changeWikiConfig: Add --touch option [extensions/GrowthExperiments] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972258 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [15:49:28] (03CR) 10JHathaway: reuse-parts.sh: remove bashisms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972061 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [15:49:38] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972257 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [15:49:55] * bvibber vanishes back into the night "i'm batman" [15:50:14] hello bvibber-batman! [15:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:54:06] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972258 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [15:54:10] (03PS1) 10JMeybohm: Update api-gateway to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [15:54:56] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) Thank you @Volans ! Got it, will do that :) [15:55:32] !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) [15:56:23] (03PS38) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [15:56:35] (03CR) 10Jbond: [C: 03+2] dragonfly: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972366 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:57:30] (03PS2) 10Kamila Součková: [WIP] add kube-state-metrics helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 [15:57:52] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:58:34] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:58:50] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:58:55] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:59:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:05] eoghan, jelto, and arnoldokoth: #bothumor I � Unicode. All rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1600). [16:00:16] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [16:00:54] (03PS39) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [16:01:04] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1539 days) https://wikitech.wikimedia.org/wiki/Logs [16:01:39] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks. I'll aim to try it out this week." [puppet] - 10https://gerrit.wikimedia.org/r/972061 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [16:02:46] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:02:48] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:04:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::ui::superset [16:07:20] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:07:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Changing to -1 as the port needs to be corrected" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [16:08:25] (03CR) 10JHathaway: [C: 03+2] reuse-parts.sh: remove bashisms [puppet] - 10https://gerrit.wikimedia.org/r/972061 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [16:11:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:11:40] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:12:24] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:13:11] (03CR) 10Giuseppe Lavagetto: "AIUI this would have also the effect of having half the workers actually doing their job?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [16:14:44] (03CR) 10CI reject: [V: 04-1] changeWikiConfig: Add --touch option [extensions/GrowthExperiments] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972258 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [16:15:00] (03PS4) 10Jbond: varnishkafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) [16:16:23] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:17:03] (03PS5) 10Jbond: varnishkafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) [16:17:19] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [16:20:28] (03CR) 10Cwhite: [C: 03+1] base.statsd: add prestop sleep helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/972340 (owner: 10Giuseppe Lavagetto) [16:20:45] (03CR) 10Cwhite: [C: 03+1] mediawiki: update statsd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/972341 (owner: 10Giuseppe Lavagetto) [16:20:57] (03CR) 10Cwhite: [C: 03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972342 (https://phabricator.wikimedia.org/T240685) (owner: 10Giuseppe Lavagetto) [16:21:04] (03CR) 10Cwhite: [C: 03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972343 (https://phabricator.wikimedia.org/T240685) (owner: 10Giuseppe Lavagetto) [16:22:55] (03PS1) 10Jbond: puppet::agent: add ability to disable puppet timer [puppet] - 10https://gerrit.wikimedia.org/r/972410 [16:24:24] (03PS1) 10Muehlenhoff: Add Puppet aliases to easily query for hosts migrated/not yet migrated to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) [16:24:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/347/con" [puppet] - 10https://gerrit.wikimedia.org/r/972410 (owner: 10Jbond) [16:26:37] (03CR) 10CI reject: [V: 04-1] Add Puppet aliases to easily query for hosts migrated/not yet migrated to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:29:47] (03PS3) 10Kamila Součková: [WIP] add kube-state-metrics helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 [16:31:02] (03CR) 10Filippo Giunchedi: "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/971187 (https://phabricator.wikimedia.org/T347593) (owner: 10EoghanGaffney) [16:31:24] (03PS2) 10Muehlenhoff: Add Puppet aliases for hosts running Puppet 5 and Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) [16:31:57] (03PS1) 10Jbond: sre.puppet-migrate-*: update cookbooks to stop puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/972415 [16:32:45] (03PS2) 10Jbond: sre.puppet-migrate-*: update cookbooks to stop puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/972415 [16:34:44] (03CR) 10Urbanecm: [C: 03+2] "ci failure fixed in T350338" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972258 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [16:34:46] (03PS4) 10Kamila Součková: [WIP] add kube-state-metrics helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) [16:34:51] (03CR) 10Urbanecm: [C: 03+2] changeWikiConfig: Add --touch option [extensions/GrowthExperiments] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972257 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [16:35:03] (03CR) 10JMeybohm: "I've tried to move the config a bit closer to what we do with the default mesh config, although I'm not sure that's the best approach." [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [16:35:15] 10SRE, 10SRE Observability, 10fundraising-tech-ops: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10Jgreen) 05Open→03Resolved Done! [16:35:40] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "ci error fixed in T350338" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972258 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [16:36:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972257 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [16:36:32] (03PS1) 10Ottomata: eventgate chart - Increase default cpu limits to 1500m [deployment-charts] - 10https://gerrit.wikimedia.org/r/972418 (https://phabricator.wikimedia.org/T347477) [16:36:34] (03PS1) 10Ottomata: eventgate chart - set stream_config_retries to 3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972419 (https://phabricator.wikimedia.org/T326002) [16:36:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/972415 (owner: 10Jbond) [16:37:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/972410 (owner: 10Jbond) [16:38:18] (03CR) 10Ottomata: [C: 03+2] eventgate chart - Increase default cpu limits to 1500m [deployment-charts] - 10https://gerrit.wikimedia.org/r/972418 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [16:38:49] (03CR) 10Ottomata: [C: 03+2] eventgate chart - set stream_config_retries to 3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972419 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [16:39:16] (03Merged) 10jenkins-bot: eventgate chart - Increase default cpu limits to 1500m [deployment-charts] - 10https://gerrit.wikimedia.org/r/972418 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [16:39:18] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:39:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet::agent: add ability to disable puppet timer [puppet] - 10https://gerrit.wikimedia.org/r/972410 (owner: 10Jbond) [16:39:55] (03Merged) 10jenkins-bot: eventgate chart - set stream_config_retries to 3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972419 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [16:40:27] (03CR) 10Jbond: [C: 03+2] sre.puppet-migrate-*: update cookbooks to stop puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/972415 (owner: 10Jbond) [16:40:42] (03PS3) 10Muehlenhoff: Add Puppet aliases for hosts running Puppet 5 and Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) [16:40:54] (03CR) 10Muehlenhoff: Add Puppet aliases for hosts running Puppet 5 and Puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:42:07] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [16:42:50] !log increasing eventgate cpu limits 1000m -> 1500m hopefully to reduce throttling, also setting stream_config_retries: 3 to avoid stream config refetch failures for eventgate-analytics-external. [16:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:03] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [16:44:33] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [16:44:59] (03Merged) 10jenkins-bot: sre.puppet-migrate-*: update cookbooks to stop puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/972415 (owner: 10Jbond) [16:45:54] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [16:46:28] (03CR) 10Volans: [C: 04-1] Add Puppet aliases for hosts running Puppet 5 and Puppet 7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:46:45] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [16:50:35] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] changeWikiConfig: Add --touch option [extensions/GrowthExperiments] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972257 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [16:51:02] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:972257|changeWikiConfig: Add --touch option (T347157)]], [[gerrit:972258|changeWikiConfig: Add --touch option (T347157)]] [16:51:10] T347157: Structured mentor list: Migrate `autoAssigned` into weight - https://phabricator.wikimedia.org/T347157 [16:52:24] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:972257|changeWikiConfig: Add --touch option (T347157)]], [[gerrit:972258|changeWikiConfig: Add --touch option (T347157)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:52:53] !log urbanecm@deploy2002 urbanecm: Continuing with sync [16:54:57] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Epic, 10MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Ladsgroup) 05Open→03Resolved {T350367} put a lot of these assumptions to t... [16:55:38] (03CR) 10JMeybohm: [WIP] add kube-state-metrics helmfile (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [16:57:10] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [16:58:09] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [16:58:11] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:972257|changeWikiConfig: Add --touch option (T347157)]], [[gerrit:972258|changeWikiConfig: Add --touch option (T347157)]] (duration: 07m 08s) [16:58:14] T347157: Structured mentor list: Migrate `autoAssigned` into weight - https://phabricator.wikimedia.org/T347157 [16:58:42] (03PS40) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [16:59:31] (03PS1) 10Jbond: puppetserver: don't run puppet merge on puppet7 infrastuctre [puppet] - 10https://gerrit.wikimedia.org/r/972423 [17:00:03] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [17:00:06] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1700). [17:00:06] No Gerrit patches in the queue for this window AFAICS. [17:01:28] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:01:40] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [17:01:41] (03PS2) 10Jbond: puppetserver: don't run puppet merge on puppet7 infrastuctre [puppet] - 10https://gerrit.wikimedia.org/r/972423 [17:01:42] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:01:59] (03PS1) 10Btullis: Use new mariadb server for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/972424 (https://phabricator.wikimedia.org/T284150) [17:02:06] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [17:02:52] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [17:03:22] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [17:04:10] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/972423 (owner: 10Jbond) [17:04:12] (03CR) 10Jbond: etcd: update to use shared SSL CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [17:07:31] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [17:08:21] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [17:08:43] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1005.eqiad.wmnet [17:08:45] (Device rebooted) firing: Alert for device ps1-d1-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:09:06] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [17:09:24] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [17:09:46] (03PS2) 10Btullis: Use new mariadb server for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/972424 (https://phabricator.wikimedia.org/T284150) [17:10:01] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [17:10:17] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [17:11:20] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:11:38] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:11:52] (03CR) 10ArielGlenn: use virtual db domain for CentralAuth and GlobalBlocking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [17:11:52] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [17:11:54] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:12:04] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:12:10] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [17:12:30] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [17:12:46] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:12:50] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [17:13:08] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:13:12] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [17:13:19] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:13:20] (03PS1) 10JHathaway: icinga-init.sh: add shellcheck directive [puppet] - 10https://gerrit.wikimedia.org/r/972426 (https://phabricator.wikimedia.org/T95064) [17:13:28] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [17:13:32] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:13:45] (Device rebooted) resolved: Device ps1-d1-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:14:08] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/972426 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [17:14:32] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:14:38] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 12 CORE_DIFF 7 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compil" [puppet] - 10https://gerrit.wikimedia.org/r/972424 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [17:15:34] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:18:57] (03PS1) 10Hnowlan: rest-gateway: respond on service mesh hostname:port [deployment-charts] - 10https://gerrit.wikimedia.org/r/972427 (https://phabricator.wikimedia.org/T349517) [17:19:49] (03CR) 10Dzahn: [C: 03+2] site/netboot: add special VMs for stewards [puppet] - 10https://gerrit.wikimedia.org/r/972070 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [17:20:44] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudservices1005.eqiad.wmnet [17:21:48] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1005.eqiad.wmnet with OS bookworm [17:22:39] (03PS1) 10Daimona Eaytoy: beta: Stop setting $wgCampaignEventsEnableEmail, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972428 (https://phabricator.wikimedia.org/T347067) [17:23:32] (03PS1) 10Daimona Eaytoy: prod: Stop setting $wgCampaignEventsEnableEmail, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972429 (https://phabricator.wikimedia.org/T347067) [17:24:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10RobH) 05Open→03Resolved >>! In T350504#9312052, @cmooney wrote: > I added the config to set the port to 100G and bounced the PIC (the asw facing ports on it we... [17:26:11] (03PS1) 10Daimona Eaytoy: Remove feature flag for email [extensions/CampaignEvents] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972260 (https://phabricator.wikimedia.org/T347067) [17:26:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:27:44] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:27:53] (03PS3) 10Btullis: Use new mariadb server for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/972424 (https://phabricator.wikimedia.org/T284150) [17:31:58] (03PS5) 10Kamila Součková: [WIP] add kube-state-metrics helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) [17:36:34] (03CR) 10Kamila Součková: [WIP] add kube-state-metrics helmfile (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [17:36:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:39:36] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:00] (03CR) 10Muehlenhoff: [C: 03+1] icinga-init.sh: add shellcheck directive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972426 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [17:40:29] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [17:40:41] (03PS4) 10Muehlenhoff: Add Puppet aliases for hosts running Puppet 5 and Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) [17:40:42] 10SRE, 10ops-esams, 10Documentation: Update on-wiki documentation about esams - https://phabricator.wikimedia.org/T344129 (10RobH) 05Open→03Resolved >>! In T344129#9311840, @ayounsi wrote: > There are some outdated pages, for example: > * https://wikitech.wikimedia.org/wiki/Esams_data_center Deleted!... [17:41:06] (03CR) 10Muehlenhoff: Add Puppet aliases for hosts running Puppet 5 and Puppet 7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [17:41:12] (03CR) 10Kamila Součková: [C: 03+1] rest-gateway: respond on service mesh hostname:port [deployment-charts] - 10https://gerrit.wikimedia.org/r/972427 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [17:42:21] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6 DIFF 13 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compil" [puppet] - 10https://gerrit.wikimedia.org/r/972424 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [17:42:26] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:25] (03PS1) 10Jcrespo: Tranferrer: Enable transfers other than misc, core or x1 sections [software/transferpy] - 10https://gerrit.wikimedia.org/r/972433 (https://phabricator.wikimedia.org/T284150) [17:49:25] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: respond on service mesh hostname:port [deployment-charts] - 10https://gerrit.wikimedia.org/r/972427 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [17:50:31] (03Merged) 10jenkins-bot: rest-gateway: respond on service mesh hostname:port [deployment-charts] - 10https://gerrit.wikimedia.org/r/972427 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [17:50:33] (03CR) 10Jbond: [C: 03+2] puppetserver: don't run puppet merge on puppet7 infrastuctre [puppet] - 10https://gerrit.wikimedia.org/r/972423 (owner: 10Jbond) [17:51:11] (03PS5) 10EoghanGaffney: [gitlab] Add metrics for timing backups/restores [puppet] - 10https://gerrit.wikimedia.org/r/971187 (https://phabricator.wikimedia.org/T347593) [17:52:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:17] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10Arnoldokoth) [17:52:34] (03CR) 10Volans: [C: 03+1] "LGTM (with a nit)" [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [17:55:35] (03PS1) 10Gergő Tisza: CentralAuth: Clear domain cookie when setting non-domain cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) [18:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1800) [18:00:21] (03CR) 10JMeybohm: [WIP] add kube-state-metrics helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [18:00:44] (03CR) 10EoghanGaffney: [gitlab] Add metrics for timing backups/restores (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971187 (https://phabricator.wikimedia.org/T347593) (owner: 10EoghanGaffney) [18:00:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:03] (03CR) 10JMeybohm: [C: 03+1] Initial commit of kube-state-metrics chart from prometheus-community [deployment-charts] - 10https://gerrit.wikimedia.org/r/970425 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [18:01:39] (03CR) 10Jcrespo: "See ticket for context- the original regex was too strict." [software/transferpy] - 10https://gerrit.wikimedia.org/r/972433 (https://phabricator.wikimedia.org/T284150) (owner: 10Jcrespo) [18:02:25] (03PS41) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [18:05:40] (03PS2) 10Gergő Tisza: CentralAuth: Clear domain cookie when setting non-domain cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) [18:06:01] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: mirrors [18:07:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:09] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:08:15] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:08:24] (03PS1) 10Jbond: mirrors: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972441 (https://phabricator.wikimedia.org/T349619) [18:09:19] (03PS1) 10Hnowlan: service, conftool: add mw-jobrunner config [puppet] - 10https://gerrit.wikimedia.org/r/972442 (https://phabricator.wikimedia.org/T349796) [18:09:25] (03CR) 10Jbond: [C: 03+2] mirrors: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972441 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [18:10:20] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:10:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:11:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [18:12:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:13:34] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mirrors [18:13:56] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:16:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:35] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: netmon [18:16:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:17:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:17:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:18:36] (03PS1) 10Jbond: O:netmon: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972443 (https://phabricator.wikimedia.org/T349619) [18:18:42] No idea if this is the right spot to post that question. Still: What do I do, if I need an admin to cleanup a script in the MediaWiki namespace, but the project has no admins? ^^' [18:18:55] (03CR) 10Jbond: [C: 03+2] O:netmon: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972443 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [18:22:34] Is there a process to get someone very limited temporary interface admin rights to do such a maintenance thing? [18:23:40] PROBLEM - Host logstash1023 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:48] (03PS42) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [18:23:57] (03CR) 10JHathaway: icinga-init.sh: add shellcheck directive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972426 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [18:24:36] RECOVERY - Host logstash1023 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [18:25:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: netmon [18:26:27] (03PS6) 10Cwhite: Add StatsLib settings for Test env [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) [18:26:35] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:26:37] WMDE-Fisch: try stewards [18:26:44] urbanecm: might be able to help [18:26:45] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:26:47] (03CR) 10Cwhite: Add StatsLib settings for Test env (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [18:26:52] what's up, [18:27:05] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:27:08] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:27:09] WMDE-Fisch: what is the project you need the rights at and what kind of cleanup you need? :) [18:27:26] Hi urbanecm . I need to have a script on fi.wikovoyage fixed but there are no admins :-) [18:27:30] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:28:16] https://fi.wikivoyage.org/wiki/Keskustelu_j%C3%A4rjestelm%C3%A4viestist%C3%A4:Kartographer.js [18:28:46] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:03] (03PS43) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [18:29:06] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:29:36] hello WMDE-Fisch :). gave you int adminship for a day, hopefully that's enough :) [18:29:58] urbanecm: Great! More than enough. [18:30:03] Thanks urbanecm [18:30:04] (03CR) 10Bartosz Dziewoński: [C: 04-1] "Should we do this for all wikis, or only Meta and Commons? Is any other wiki affected?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) (owner: 10Gergő Tisza) [18:30:21] !log performing rolling memory increase on logstash collector VMs T350434 [18:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:28] T350434: Logstash collector tuning - https://phabricator.wikimedia.org/T350434 [18:30:30] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:30:35] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:30:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host stewards2001.codfw.wmnet with OS bookworm [18:30:50] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards2001.codfw.wmnet... [18:32:31] (03PS1) 10Marostegui: db1192: New candidate master for s8 [puppet] - 10https://gerrit.wikimedia.org/r/972444 (https://phabricator.wikimedia.org/T346454) [18:32:52] WMDE-Fisch: more generally, there are two things you can do in situations like this. One is to approach someone who has the rights to make script changes on any project (either a steward or a global interface admin). The second route is to become a global interface admin yourself; that way, you'd have the rights of an interface admin on all projects. What you want to do depends on how often you run into situations like [18:32:52] this. [18:33:17] You can always find someone to help in situation like this in #wikimedia-stewards [18:33:22] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:34:22] WMDE-Fisch: A complete list of global interface admins is available at https://meta.wikimedia.org/wiki/Special:GlobalUsers/global-interface-editor, a place to request them to do something would be https://meta.wikimedia.org/wiki/Steward_requests/Miscellaneous and a place to request that right yourself would be https://meta.wikimedia.org/wiki/Steward_requests/Global_permissions. [18:34:30] If anything's unclear, glad to calrify :) [18:34:55] urbanecm: did you finish the deployment? [18:35:01] (03CR) 10Marostegui: [C: 03+2] db1192: New candidate master for s8 [puppet] - 10https://gerrit.wikimedia.org/r/972444 (https://phabricator.wikimedia.org/T346454) (owner: 10Marostegui) [18:35:05] marostegui: yes, and i think i pinged you too? [18:35:21] (03PS1) 10Jcrespo: test: Add a few style corrections so it works on newer versions [software/transferpy] - 10https://gerrit.wikimedia.org/r/972446 [18:35:44] maybe, I have been in meetings so possibly missed it! [18:35:46] thank you! [18:35:51] urbanecm: Thanks so much. Also for all the helpful links and information. I did what I needed to do there. :-) [18:35:54] jouncebot: next [18:35:54] In 0 hour(s) and 24 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1900) [18:35:59] right on time! [18:36:12] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972393 (owner: 10Marostegui) [18:36:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2012,2014].codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Upgrade [18:36:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2012,2014].codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Upgrade [18:37:06] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972393 (owner: 10Marostegui) [18:37:25] WMDE-Fisch: any time. feel free to reach to me if you need help in a case like this :) [18:37:35] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:972393|ProductionServices.php: Promote pc1014 to pc2 master]] [18:37:50] I'll do. Also thanks RhinosF1 for poking :-) [18:38:59] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:972393|ProductionServices.php: Promote pc1014 to pc2 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:39:02] !log marostegui@deploy2002 marostegui: Continuing with sync [18:39:39] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972261 [18:39:44] Also happy to poke the right people anytime [18:39:56] (03PS2) 10Jbond: prometheus-puppet-agent-stats: this timer sometime fails [puppet] - 10https://gerrit.wikimedia.org/r/971946 [18:40:11] (03CR) 10Jbond: prometheus-puppet-agent-stats: this timer sometime fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [18:41:29] (03CR) 10Jbond: [C: 04-1] "this will fail as is" [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [18:41:49] (03CR) 10Gergő Tisza: CentralAuth: Clear domain cookie when setting non-domain cookie (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) (owner: 10Gergő Tisza) [18:44:22] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:972393|ProductionServices.php: Promote pc1014 to pc2 master]] (duration: 06m 47s) [18:46:02] (03CR) 10Cmelo: [C: 03+1] Remove feature flag for email [extensions/CampaignEvents] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972260 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [18:48:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T350643 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [18:50:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1192 T346454', diff saved to https://phabricator.wikimedia.org/P53153 and previous config saved to /var/cache/conftool/dbconfig/20231107-185033-root.json [18:50:37] T346454: Master and candidate master of s5 and s8 in eqiad are in the same row - https://phabricator.wikimedia.org/T346454 [18:52:40] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972261 (owner: 10Marostegui) [18:53:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:53:31] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972261 (owner: 10Marostegui) [18:54:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Change binlog format', diff saved to https://phabricator.wikimedia.org/P53154 and previous config saved to /var/cache/conftool/dbconfig/20231107-185400-root.json [18:54:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:54:38] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:972261|Revert "ProductionServices.php: Promote pc1014 to pc2 master"]] [18:54:41] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:11] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [18:55:58] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:972261|Revert "ProductionServices.php: Promote pc1014 to pc2 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:56:06] !log marostegui@deploy2002 marostegui: Continuing with sync [18:56:17] (03PS1) 10Marostegui: db1126: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/972450 (https://phabricator.wikimedia.org/T346454) [18:57:18] (03PS3) 10Gergő Tisza: CentralAuth: Clear domain cookie when setting non-domain cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) [18:57:32] (03CR) 10Gergő Tisza: CentralAuth: Clear domain cookie when setting non-domain cookie (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) (owner: 10Gergő Tisza) [18:57:52] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: apt_staging [18:58:31] (03PS2) 10Jforrester: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971990 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:58:34] (03CR) 10Marostegui: [C: 03+2] db1126: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/972450 (https://phabricator.wikimedia.org/T346454) (owner: 10Marostegui) [18:58:40] (03Abandoned) 10Jforrester: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971990 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:59:18] (03PS1) 10Jbond: apt_staging: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972451 (https://phabricator.wikimedia.org/T349619) [18:59:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stewards2001.codfw.wmnet with reason: host reimage [18:59:42] (03CR) 10Jbond: [C: 03+2] apt_staging: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972451 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [19:00:05] jnuche and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T1900). [19:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:19] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:972261|Revert "ProductionServices.php: Promote pc1014 to pc2 master"]] (duration: 06m 40s) [19:01:59] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [19:02:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stewards2001.codfw.wmnet with reason: host reimage [19:03:13] (03CR) 10Bartosz Dziewoński: [C: 03+1] CentralAuth: Clear domain cookie when setting non-domain cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) (owner: 10Gergő Tisza) [19:04:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: apt_staging [19:06:06] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: kerberos::kdc [19:07:01] (03CR) 10Bartosz Dziewoński: [C: 03+1] "I guess we need to keep this workaround for up to a year, since that's how long the cookies last." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) (owner: 10Gergő Tisza) [19:07:56] (03PS1) 10Jbond: kerberos::kdc: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972452 (https://phabricator.wikimedia.org/T349619) [19:08:38] (03CR) 10Jbond: [C: 03+2] kerberos::kdc: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972452 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [19:09:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Change binlog format', diff saved to https://phabricator.wikimedia.org/P53155 and previous config saved to /var/cache/conftool/dbconfig/20231107-190905-root.json [19:12:37] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:13:31] (03CR) 10Muehlenhoff: [C: 03+1] icinga-init.sh: add shellcheck directive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972426 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [19:13:37] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kerberos::kdc [19:14:11] (03CR) 10JHathaway: [C: 03+2] icinga-init.sh: add shellcheck directive [puppet] - 10https://gerrit.wikimedia.org/r/972426 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [19:14:47] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [19:15:25] (03CR) 10Gergő Tisza: CentralAuth: Clear domain cookie when setting non-domain cookie (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) (owner: 10Gergő Tisza) [19:16:21] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:16:25] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [19:16:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stewards2001.codfw.wmnet with OS bookworm [19:16:41] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards2001.codfw.wmnet with... [19:17:21] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: mail::mx [19:18:03] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:19:15] (03PS1) 10Jbond: O:mail::mx: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972454 (https://phabricator.wikimedia.org/T349619) [19:19:47] (03CR) 10Jbond: [C: 03+2] O:mail::mx: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972454 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [19:22:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [19:22:23] (03PS44) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [19:22:29] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host stewards1001.eqiad.wmnet [19:22:30] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [19:24:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Change binlog format', diff saved to https://phabricator.wikimedia.org/P53156 and previous config saved to /var/cache/conftool/dbconfig/20231107-192410-root.json [19:25:32] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:25:42] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:26:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mail::mx [19:28:52] (03PS1) 10Herron: logstash: increase heap to 4g [puppet] - 10https://gerrit.wikimedia.org/r/972456 (https://phabricator.wikimedia.org/T350434) [19:32:12] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:32:19] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:32:43] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/350/co" [puppet] - 10https://gerrit.wikimedia.org/r/972456 (https://phabricator.wikimedia.org/T350434) (owner: 10Herron) [19:33:23] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:37] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Change binlog format', diff saved to https://phabricator.wikimedia.org/P53157 and previous config saved to /var/cache/conftool/dbconfig/20231107-193915-root.json [19:45:44] 10SRE, 10Math, 10RESTBase-API, 10MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10SalixAlba) Its nearly clear, https://en.wikipedia.org/wiki/Barnes_G-function and... [19:46:53] (03PS1) 10Marostegui: pc1014: Move to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/972458 [19:50:31] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) Thanks @dzahn for making the VM! Following our IRC conversations, I'm putting a list of packages/requirements th... [19:51:25] (03CR) 10Marostegui: [C: 03+2] pc1014: Move to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/972458 (owner: 10Marostegui) [19:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:54:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Change binlog format', diff saved to https://phabricator.wikimedia.org/P53158 and previous config saved to /var/cache/conftool/dbconfig/20231107-195420-root.json [19:55:27] (03PS1) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 [19:55:51] jouncebot: next [19:55:52] In 1 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T2100) [19:57:38] (03PS2) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 [19:59:30] (03PS1) 10Hashar: Make serve:plugins emit a 404 for missing files [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972460 [20:04:30] (03CR) 10CI reject: [V: 04-1] puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [20:06:00] (03PS1) 10Eevans: aqs: add .../aqs/deploy/src/ to Environment [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) [20:06:26] (03CR) 10CI reject: [V: 04-1] aqs: add .../aqs/deploy/src/ to Environment [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) (owner: 10Eevans) [20:06:41] (03PS2) 10Eevans: aqs: add .../aqs/deploy/src/ to Environment [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) [20:07:07] (03CR) 10CI reject: [V: 04-1] aqs: add .../aqs/deploy/src/ to Environment [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) (owner: 10Eevans) [20:07:43] (03PS3) 10Eevans: aqs: add .../aqs/deploy/src/ to Environment [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) [20:07:47] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:08:47] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) (owner: 10Eevans) [20:11:27] (03PS4) 10Eevans: aqs: add .../aqs/deploy/src/ to Environment [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) [20:14:20] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) (owner: 10Eevans) [20:16:25] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM stewards1001.eqiad.wmnet - dzahn@cumin1001" [20:18:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM stewards1001.eqiad.wmnet - dzahn@cumin1001" [20:18:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:18:11] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache stewards1001.eqiad.wmnet on all recursors [20:18:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) stewards1001.eqiad.wmnet on all recursors [20:18:40] (03PS2) 10Hashar: Remap serving plugins under /r/plugins/ [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972396 [20:18:40] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM stewards1001.eqiad.wmnet - dzahn@cumin1001" [20:19:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM stewards1001.eqiad.wmnet - dzahn@cumin1001" [20:20:15] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host stewards1001.eqiad.wmnet with OS bookworm [20:20:28] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet... [20:21:20] (03PS1) 10DLynch: Enable edit check on fonwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972462 (https://phabricator.wikimedia.org/T350634) [20:24:01] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:24:03] (03CR) 10JHathaway: [C: 03+1] "looks good, pcc run?" [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:25:26] (03CR) 10JHathaway: realm.pp: drop namservers global (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:26:09] (03CR) 10JHathaway: [C: 03+1] "looks good, pcc run?" [puppet] - 10https://gerrit.wikimedia.org/r/971463 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:26:30] (03CR) 10JHathaway: [C: 03+1] realm.pp: remove old comments [puppet] - 10https://gerrit.wikimedia.org/r/971464 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:29:35] (03CR) 10JHathaway: realm.pp: drop wikimail_smarthost global (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:30:40] (03CR) 10JHathaway: realm: drop mail_smarthost global (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:31:02] (03PS1) 10Ottomata: mw-page-content-change-enrich - bump to v1.29.0 to pick up retry logic change [deployment-charts] - 10https://gerrit.wikimedia.org/r/972463 (https://phabricator.wikimedia.org/T347884) [20:31:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage [20:31:51] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:33:33] 10SRE, 10Math, 10RESTBase-API, 10MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10Physikerwelt) >>! In T343648#9314173, @SalixAlba wrote: > Its nearly clear, http... [20:34:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage [20:37:44] 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Switch conftool to use the version 3 etcd datastore - https://phabricator.wikimedia.org/T350565 (10KOfori) [20:42:37] (03CR) 10DCausse: rdf-streaming-updater: update values for application mode (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [20:43:56] (03CR) 10DCausse: rdf-streaming-updater: update values for application mode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [20:47:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stewards1001.eqiad.wmnet with OS bookworm [20:47:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host stewards1001.eqiad.wmnet [20:47:32] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with... [20:56:53] (03CR) 10JHathaway: alertmanager: add alerts-triage on /triage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/972335 (https://phabricator.wikimedia.org/T350014) (owner: 10Filippo Giunchedi) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T2100). [21:00:05] tgr and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:10] o/ [21:01:24] Yo [21:01:34] I can deploy, will need lots of manual testing [21:01:40] and I have one more patch coming up [21:02:16] Mine should be super-quick, if you want to get it out of the way. [21:02:27] was just about to say that [21:03:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972462 (https://phabricator.wikimedia.org/T350634) (owner: 10DLynch) [21:03:35] jouncebot: nowandnext [21:03:35] For the next 0 hour(s) and 56 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T2100) [21:03:35] In 9 hour(s) and 56 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T0700) [21:04:20] i'm around if you need help testing [21:04:40] thx [21:04:40] (03Merged) 10jenkins-bot: Enable edit check on fonwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972462 (https://phabricator.wikimedia.org/T350634) (owner: 10DLynch) [21:05:07] !log tgr@deploy2002 Started scap: Backport for [[gerrit:972462|Enable edit check on fonwiki (T350634)]] [21:05:25] T350634: [Config] Enable Edit Check (References) at fon.wiki - https://phabricator.wikimedia.org/T350634 [21:06:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:28] !log tgr@deploy2002 tgr and kemayo: Backport for [[gerrit:972462|Enable edit check on fonwiki (T350634)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:03] (03PS1) 10Jcrespo: Transferer: Add a few fixes after lintering to clean up the code [software/transferpy] - 10https://gerrit.wikimedia.org/r/972471 [21:07:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:07:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:07:55] Kemayo: do you want to test it? [21:08:30] Sure, just give me second. 2002? [21:08:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.227 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:08:53] (03PS1) 10Gergő Tisza: Fix centralauthtoken key schema migration [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972262 (https://phabricator.wikimedia.org/T347223) [21:08:55] (03PS2) 10Jcrespo: Transferer: Add a few fixes after lintering to clean up the code [software/transferpy] - 10https://gerrit.wikimedia.org/r/972471 [21:09:06] tgr: It's working fine, sync away. [21:09:28] !log tgr@deploy2002 tgr and kemayo: Continuing with sync [21:09:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:53] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:972462|Enable edit check on fonwiki (T350634)]] (duration: 09m 45s) [21:14:57] T350634: [Config] Enable Edit Check (References) at fon.wiki - https://phabricator.wikimedia.org/T350634 [21:16:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) (owner: 10Gergő Tisza) [21:16:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:44] (03Merged) 10jenkins-bot: CentralAuth: Clear domain cookie when setting non-domain cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972435 (https://phabricator.wikimedia.org/T350695) (owner: 10Gergő Tisza) [21:17:09] !log tgr@deploy2002 Started scap: Backport for [[gerrit:972435|CentralAuth: Clear domain cookie when setting non-domain cookie (T350695)]] [21:17:22] T350695: "sessionfailure" errors on Meta and Commons - https://phabricator.wikimedia.org/T350695 [21:18:29] !log tgr@deploy2002 tgr: Backport for [[gerrit:972435|CentralAuth: Clear domain cookie when setting non-domain cookie (T350695)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:17] !log tgr@deploy2002 tgr: Continuing with sync [21:35:17] !log changing email for User:Rlayton-WMF [21:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:51] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T350703 [21:36:55] T350703: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 [21:37:36] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:972435|CentralAuth: Clear domain cookie when setting non-domain cookie (T350695)]] (duration: 20m 27s) [21:37:40] T350695: "sessionfailure" errors on Meta and Commons - https://phabricator.wikimedia.org/T350695 [21:39:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972262 (https://phabricator.wikimedia.org/T347223) (owner: 10Gergő Tisza) [21:44:41] (03Merged) 10jenkins-bot: Fix centralauthtoken key schema migration [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972262 (https://phabricator.wikimedia.org/T347223) (owner: 10Gergő Tisza) [21:45:03] !log tgr@deploy2002 Started scap: Backport for [[gerrit:972262|Fix centralauthtoken key schema migration (T347223 T350723)]] [21:45:11] T347223: Exception: Key contains invalid characters: centralauth:central-login-complete-token:1�À§À¢%2527%2522 - https://phabricator.wikimedia.org/T347223 [21:45:12] T350723: GlobalWatchlist: cross-wiki request to mediawiki.org or test.wikipedia.org failing with `badtoken` - https://phabricator.wikimedia.org/T350723 [21:45:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:24] !log tgr@deploy2002 tgr: Backport for [[gerrit:972262|Fix centralauthtoken key schema migration (T347223 T350723)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:50:28] (03PS1) 10Strainu: [namespaces] Use correct diacritics in Romanian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) [21:52:56] (03PS1) 10Jcrespo: RemoteExecution: Restore RemoteExecution class back into transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/972475 (https://phabricator.wikimedia.org/T330882) [21:53:06] !log tgr@deploy2002 tgr: Continuing with sync [21:53:39] (03CR) 10CI reject: [V: 04-1] RemoteExecution: Restore RemoteExecution class back into transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/972475 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [21:54:58] jouncebot: nowandnext [21:54:58] For the next 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231107T2100) [21:54:58] In 9 hour(s) and 5 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T0700) [21:55:13] tgr: let me know once done [21:55:33] will do. Ten more minutes maybe. [21:55:48] (03PS2) 10Jcrespo: RemoteExecution: Restore RemoteExecution class back into transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/972475 (https://phabricator.wikimedia.org/T330882) [21:57:10] global watchlist on meta confirmed to be fetching from mediawiki.org correctly now [21:58:20] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:972262|Fix centralauthtoken key schema migration (T347223 T350723)]] (duration: 13m 17s) [21:58:26] T347223: Exception: Key contains invalid characters: centralauth:central-login-complete-token:1�À§À¢%2527%2522 - https://phabricator.wikimedia.org/T347223 [21:58:26] T350723: GlobalWatchlist: cross-wiki request to mediawiki.org or test.wikipedia.org failing with `badtoken` - https://phabricator.wikimedia.org/T350723 [21:58:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971623 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza) [22:00:54] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T350703 [22:00:58] T350703: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 [22:01:00] (03PS2) 10Gergő Tisza: Do not try to use Thumbor on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971623 (https://phabricator.wikimedia.org/T344605) [22:01:14] (03CR) 10TrainBranchBot: "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971623 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza) [22:02:03] (03Merged) 10jenkins-bot: Do not try to use Thumbor on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971623 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza) [22:02:59] Amir1: all yours [22:03:04] thanks [22:04:20] (03PS2) 10Strainu: [namespaces] Use correct diacritics in Romanian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) [22:07:31] (03PS45) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [22:07:58] (03CR) 10Ladsgroup: [C: 03+2] Replace WikimediaUI Base with Codex design tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971604 (https://phabricator.wikimedia.org/T331403) (owner: 10VolkerE) [22:08:40] (03Merged) 10jenkins-bot: Replace WikimediaUI Base with Codex design tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971604 (https://phabricator.wikimedia.org/T331403) (owner: 10VolkerE) [22:09:25] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:971604|Replace WikimediaUI Base with Codex design tokens (T331403 T334934)]] [22:09:32] T334934: [EPIC] Replace WikimediaUI Base variables with Codex design tokens (mediawiki.skin.variables) - https://phabricator.wikimedia.org/T334934 [22:09:32] T331403: Replace legacy value tokens in WikimediaUI Base, OOUI and downstream - https://phabricator.wikimedia.org/T331403 [22:10:52] !log ladsgroup@deploy2002 ladsgroup and volker-e: Backport for [[gerrit:971604|Replace WikimediaUI Base with Codex design tokens (T331403 T334934)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:13:19] !log ladsgroup@deploy2002 ladsgroup and volker-e: Continuing with sync [22:15:34] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/972456 (https://phabricator.wikimedia.org/T350434) (owner: 10Herron) [22:18:41] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:971604|Replace WikimediaUI Base with Codex design tokens (T331403 T334934)]] (duration: 09m 15s) [22:18:46] T334934: [EPIC] Replace WikimediaUI Base variables with Codex design tokens (mediawiki.skin.variables) - https://phabricator.wikimedia.org/T334934 [22:18:46] T331403: Replace legacy value tokens in WikimediaUI Base, OOUI and downstream - https://phabricator.wikimedia.org/T331403 [22:19:04] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:19:06] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:19:26] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:19:32] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:26:44] (03PS1) 10Gergő Tisza: Revert "Do not try to use Thumbor on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972263 (https://phabricator.wikimedia.org/T344605) [22:28:40] (03PS1) 10Andrew Bogott: wmcs-backup: second attempt to add cleanup timers [puppet] - 10https://gerrit.wikimedia.org/r/972481 [22:32:09] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: second attempt to add cleanup timers [puppet] - 10https://gerrit.wikimedia.org/r/972481 (owner: 10Andrew Bogott) [22:36:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:45] (03PS1) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) [22:38:54] (03PS2) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) [22:45:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:58] (03PS1) 10Dzahn: stewards: create initial role and profile [puppet] - 10https://gerrit.wikimedia.org/r/972485 (https://phabricator.wikimedia.org/T344164) [22:49:12] (03CR) 10CI reject: [V: 04-1] stewards: create initial role and profile [puppet] - 10https://gerrit.wikimedia.org/r/972485 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [22:49:14] (03PS3) 10Aklapper: [namespaces] Use correct diacritics in Romanian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) (owner: 10Strainu) [22:49:50] (03PS2) 10Dzahn: stewards: create initial role and profile [puppet] - 10https://gerrit.wikimedia.org/r/972485 (https://phabricator.wikimedia.org/T344164) [22:52:03] (03CR) 10CI reject: [V: 04-1] stewards: create initial role and profile [puppet] - 10https://gerrit.wikimedia.org/r/972485 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [22:53:17] (03PS3) 10Dzahn: stewards: create initial role and profile [puppet] - 10https://gerrit.wikimedia.org/r/972485 (https://phabricator.wikimedia.org/T344164) [22:56:16] (03CR) 10Dzahn: [C: 03+2] stewards: create initial role and profile [puppet] - 10https://gerrit.wikimedia.org/r/972485 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [22:58:12] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:27] (03PS1) 10VolkerE: styles: Fix stylesheet validation issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972488 [22:59:12] (03PS1) 10Dzahn: stewards: fix package name for python3-requests-oauthlib [puppet] - 10https://gerrit.wikimedia.org/r/972490 (https://phabricator.wikimedia.org/T344164) [23:00:41] (03CR) 10Ladsgroup: [C: 03+2] styles: Fix stylesheet validation issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972488 (owner: 10VolkerE) [23:01:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972488 (owner: 10VolkerE) [23:01:47] (03Merged) 10jenkins-bot: styles: Fix stylesheet validation issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972488 (owner: 10VolkerE) [23:02:10] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:972488|styles: Fix stylesheet validation issues]] [23:03:11] (03CR) 10Dzahn: [C: 03+2] stewards: fix package name for python3-requests-oauthlib [puppet] - 10https://gerrit.wikimedia.org/r/972490 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [23:03:28] !log ladsgroup@deploy2002 ladsgroup and volker-e: Backport for [[gerrit:972488|styles: Fix stylesheet validation issues]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:04:14] !log ladsgroup@deploy2002 ladsgroup and volker-e: Continuing with sync [23:09:24] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:972488|styles: Fix stylesheet validation issues]] (duration: 07m 14s) [23:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure