[00:09:46] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:15:30] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:34:36] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [00:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973407 [00:38:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973407 (owner: 10TrainBranchBot) [00:55:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973407 (owner: 10TrainBranchBot) [01:30:15] (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:37:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:47:45] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:52:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:24:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:38:53] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:59:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:04:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:04:26] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:03:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:08:02] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:08:31] (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973447 (https://phabricator.wikimedia.org/T219903) [04:12:18] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:18:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:28:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:14:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:30:15] (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:34:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:40:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:44:15] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:47:45] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:15:30] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:26:45] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:36:45] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:08:53] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:25:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:33:46] (03CR) 10Muehlenhoff: [C: 03+2] Set an-master1003/1004 to use to Puppet 7 via Hiera host entries [puppet] - 10https://gerrit.wikimedia.org/r/973317 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:34:25] (03CR) 10Muehlenhoff: [C: 03+2] Switch currently unused insetup roles to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973288 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:35:53] (03PS1) 10Bartosz Dziewoński: ParserOutputAccess: Limit local cache size [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973339 (https://phabricator.wikimedia.org/T315510) [07:39:08] PROBLEM - Check systemd state on ganeti-test2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:42:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: search::loader [07:44:52] (03PS1) 10Muehlenhoff: Switch search::loader to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973688 (https://phabricator.wikimedia.org/T349619) [07:45:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:48:10] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:49:31] (03CR) 10Muehlenhoff: [C: 03+2] Switch search::loader to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973688 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:50:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:53:58] (03CR) 10Brouberol: "Thanks, I had no idea we could do this with types!" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [07:54:21] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [07:54:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: search::loader [07:56:37] (03PS20) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 [08:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:20] (03PS1) 10Muehlenhoff: Apply settings for search-loader hosts via Hiera host entries, not per role [puppet] - 10https://gerrit.wikimedia.org/r/973715 (https://phabricator.wikimedia.org/T349619) [08:06:45] (03PS3) 10Slyngshede: P:idp:services add Catalyst OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725) [08:07:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:08:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973715 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:11:56] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:15:12] (03CR) 10Muehlenhoff: [C: 03+2] Apply settings for search-loader hosts via Hiera host entries, not per role [puppet] - 10https://gerrit.wikimedia.org/r/973715 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:17:45] (03CR) 10Jelto: [C: 03+2] "looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/973447 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [08:18:53] (03Merged) 10jenkins-bot: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973447 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [08:19:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:20:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host arclamp2001.codfw.wmnet [08:22:10] (03PS1) 10Brouberol: Re-generate the skein certificates during business days [puppet] - 10https://gerrit.wikimedia.org/r/973716 (https://phabricator.wikimedia.org/T350945) [08:22:39] (03PS1) 10Muehlenhoff: Switch arclamp2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973717 (https://phabricator.wikimedia.org/T349619) [08:24:36] (03CR) 10Muehlenhoff: [C: 03+2] Switch arclamp2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973717 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:26:26] 10SRE, 10Data-Platform-SRE: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10brouberol) [08:29:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host arclamp2001.codfw.wmnet [08:30:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host graphite2004.codfw.wmnet [08:32:00] (03PS1) 10Muehlenhoff: Switch graphite2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973718 (https://phabricator.wikimedia.org/T349619) [08:32:12] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:10] (03CR) 10Volans: "Thanks for having a go at this!" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:33:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] kube-state-metrics: add build-depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973078 (https://phabricator.wikimedia.org/T350366) (owner: 10Giuseppe Lavagetto) [08:34:22] !log hashar@deploy2002 Started deploy [integration/docroot@bc8aaba]: Add more libraries to doc.wikimedia.org homepage - T327604 [08:34:27] T327604: NormalizedException not shown on doc.wikimedia.org - https://phabricator.wikimedia.org/T327604 [08:34:29] !log hashar@deploy2002 Finished deploy [integration/docroot@bc8aaba]: Add more libraries to doc.wikimedia.org homepage - T327604 (duration: 00m 06s) [08:35:22] (03CR) 10Muehlenhoff: [C: 03+2] Switch graphite2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973718 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:35:28] RECOVERY - Check systemd state on ganeti-test2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:11] (03CR) 10Volans: [C: 03+2] spicerack: log cookbook execution stats [software/spicerack] - 10https://gerrit.wikimedia.org/r/973309 (owner: 10Volans) [08:39:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host graphite2004.codfw.wmnet [08:40:51] (03CR) 10Volans: [C: 03+1] "I'll leave it to you for the inherent logic but seems sane to me" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney) [08:41:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add golang instructions to README (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973280 (owner: 10RLazarus) [08:43:26] (03Merged) 10jenkins-bot: spicerack: log cookbook execution stats [software/spicerack] - 10https://gerrit.wikimedia.org/r/973309 (owner: 10Volans) [08:43:30] (03CR) 10Volans: [C: 03+1] "LGTM (but let's test it, lmk if I can help with that)" [cookbooks] - 10https://gerrit.wikimedia.org/r/973315 (owner: 10Jbond) [08:44:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service, conftool: add mw-jobrunner config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972442 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [08:45:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host webperf2003.codfw.wmnet [08:45:08] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [08:45:58] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:46:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add StatsLib settings for Test env [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [08:47:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [08:48:30] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [08:48:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:26] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [08:49:39] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [08:49:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T348183)', diff saved to https://phabricator.wikimedia.org/P53300 and previous config saved to /var/cache/conftool/dbconfig/20231113-084945-arnaudb.json [08:49:49] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [08:50:05] (03PS1) 10Muehlenhoff: Switch webperf2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973720 (https://phabricator.wikimedia.org/T349619) [08:52:03] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I really don't like the approach of making a chart template a yaml-passthrough, where the user needs to know inner details of how the app" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [08:52:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T348183)', diff saved to https://phabricator.wikimedia.org/P53301 and previous config saved to /var/cache/conftool/dbconfig/20231113-085243-arnaudb.json [08:52:46] (03CR) 10Muehlenhoff: [C: 03+2] Switch webperf2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973720 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:52:48] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:53:18] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:42] (03PS1) 10JMeybohm: envoy: Allow additional arguments to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973721 (https://phabricator.wikimedia.org/T300033) [08:55:46] !log bounce prometheus eqiad for k8s / k8s-aux - T343529 [08:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:51] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [08:57:12] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:57:30] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:57:55] morning, I'll be backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OAuth/+/973247/ in the next few minutes [08:57:56] that's me ^ [08:58:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host webperf2003.codfw.wmnet [08:58:48] (03CR) 10Filippo Giunchedi: [C: 03+2] titan: add public_domain to tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/973158 (owner: 10Filippo Giunchedi) [09:00:29] 10SRE, 10Data-Platform-SRE, 10Patch-For-Review: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10Peachey88) [09:00:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jnuche@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973247 (https://phabricator.wikimedia.org/T350836) (owner: 10BryanDavis) [09:01:20] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:02:30] (KubernetesAPINotScrapable) resolved: (2) k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [09:05:40] (03Merged) 10jenkins-bot: Fix BlockDisablesLogin recursion [extensions/OAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973247 (https://phabricator.wikimedia.org/T350836) (owner: 10BryanDavis) [09:06:42] !log jnuche@deploy2002 Started scap: Backport for [[gerrit:973247|Fix BlockDisablesLogin recursion (T350836 T350080)]] [09:06:48] T350836: OAuth login to wikitech fails when running MediaWiki 1.42.0-wmf.4 - https://phabricator.wikimedia.org/T350836 [09:06:48] T350080: 1.42.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T350080 [09:06:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tox.ini: remove skipsdist [software/conftool] - 10https://gerrit.wikimedia.org/r/960068 (https://phabricator.wikimedia.org/T346238) (owner: 10Hashar) [09:07:08] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:07:40] (03PS1) 10Filippo Giunchedi: hieradata: fixup titan cfssl/envoy config [puppet] - 10https://gerrit.wikimedia.org/r/973722 [09:07:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P53302 and previous config saved to /var/cache/conftool/dbconfig/20231113-090750-arnaudb.json [09:08:34] !log jnuche@deploy2002 bd808 and jnuche: Backport for [[gerrit:973247|Fix BlockDisablesLogin recursion (T350836 T350080)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:08:48] !log jnuche@deploy2002 bd808 and jnuche: Continuing with sync [09:08:57] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fixup titan cfssl/envoy config [puppet] - 10https://gerrit.wikimedia.org/r/973722 (owner: 10Filippo Giunchedi) [09:10:25] (03Merged) 10jenkins-bot: tox.ini: remove skipsdist [software/conftool] - 10https://gerrit.wikimedia.org/r/960068 (https://phabricator.wikimedia.org/T346238) (owner: 10Hashar) [09:10:30] (03PS5) 10JMeybohm: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [09:10:33] (03PS4) 10JMeybohm: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) [09:11:12] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:08] (03CR) 10Muehlenhoff: "Two nits, looks good otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [09:13:20] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:32] !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:973247|Fix BlockDisablesLogin recursion (T350836 T350080)]] (duration: 07m 49s) [09:14:38] T350836: OAuth login to wikitech fails when running MediaWiki 1.42.0-wmf.4 - https://phabricator.wikimedia.org/T350836 [09:14:39] T350080: 1.42.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T350080 [09:16:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:16:52] !log hashar@deploy2002 Started deploy [integration/docroot@9bf1967]: Replace WikimediaUI Base with Codex design tokens T331403 T334934 [09:16:57] T331403: Replace legacy value tokens in WikimediaUI Base, OOUI and downstream - https://phabricator.wikimedia.org/T331403 [09:16:57] T334934: [EPIC] Replace WikimediaUI Base variables with Codex design tokens (mediawiki.skin.variables) - https://phabricator.wikimedia.org/T334934 [09:16:59] !log hashar@deploy2002 Finished deploy [integration/docroot@9bf1967]: Replace WikimediaUI Base with Codex design tokens T331403 T334934 (duration: 00m 07s) [09:18:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:58] (03CR) 10Ayounsi: [C: 03+1] Fail when setting int relations if PuppetDB parent not found in Netbox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney) [09:22:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P53303 and previous config saved to /var/cache/conftool/dbconfig/20231113-092256-arnaudb.json [09:24:12] (03CR) 10Nik Gkountas: [C: 03+1] testwiki: Enable the Unified Content Translation Dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973170 (https://phabricator.wikimedia.org/T337915) (owner: 10KartikMistry) [09:30:16] (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:30:19] (03CR) 10Elukey: changeprop: allow to define Kafka settings for Job Queues (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [09:31:19] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: labswiki to 1.42.0-wmf.4 (T350836 T350080) [09:31:25] T350836: OAuth login to wikitech fails when running MediaWiki 1.42.0-wmf.4 - https://phabricator.wikimedia.org/T350836 [09:31:26] T350080: 1.42.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T350080 [09:31:34] !log installing dbus security updates on bullseye [09:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:41] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: update service catalog roles [puppet] - 10https://gerrit.wikimedia.org/r/973159 (owner: 10Filippo Giunchedi) [09:35:48] (03PS2) 10Filippo Giunchedi: hieradata: update service catalog roles [puppet] - 10https://gerrit.wikimedia.org/r/973159 [09:36:38] (03PS1) 10Jaime Nuche: labswiki to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973724 (https://phabricator.wikimedia.org/T350836) [09:36:41] (03CR) 10Jaime Nuche: [C: 03+2] labswiki to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973724 (https://phabricator.wikimedia.org/T350836) (owner: 10Jaime Nuche) [09:36:49] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: arclamp [09:37:54] (03Merged) 10jenkins-bot: labswiki to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973724 (https://phabricator.wikimedia.org/T350836) (owner: 10Jaime Nuche) [09:38:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T348183)', diff saved to https://phabricator.wikimedia.org/P53304 and previous config saved to /var/cache/conftool/dbconfig/20231113-093802-arnaudb.json [09:38:05] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:38:07] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [09:38:11] (03PS1) 10Muehlenhoff: Switch arclamp to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973725 (https://phabricator.wikimedia.org/T349619) [09:38:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:38:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53305 and previous config saved to /var/cache/conftool/dbconfig/20231113-093824-arnaudb.json [09:39:49] (03CR) 10Muehlenhoff: [C: 03+2] Switch arclamp to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973725 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:42:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53306 and previous config saved to /var/cache/conftool/dbconfig/20231113-094218-arnaudb.json [09:43:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: arclamp [09:44:16] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: graphite::production [09:46:14] (03PS1) 10Muehlenhoff: Switch graphite::production to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973726 (https://phabricator.wikimedia.org/T349619) [09:47:11] (03PS6) 10Slyngshede: Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) [09:49:49] (03CR) 10Gmodena: [C: 03+2] data-engineering: eventgate: standardize alerts [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002) (owner: 10Gmodena) [09:50:22] (03CR) 10Muehlenhoff: [C: 03+2] Switch graphite::production to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973726 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:51:36] (03Merged) 10jenkins-bot: data-engineering: eventgate: standardize alerts [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002) (owner: 10Gmodena) [09:55:00] (03PS3) 10Volans: external clouds: allow to get prefixes from RIPE [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) [09:55:59] (03PS1) 10Btullis: Depool the cloudb10[13-16] hosts for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973728 (https://phabricator.wikimedia.org/T340741) [09:56:22] PROBLEM - Check systemd state on ganeti1027 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:45] (03PS2) 10Btullis: Depool the cloudb10[13-16] hosts for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973728 (https://phabricator.wikimedia.org/T340741) [09:57:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:57:03] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973728 (https://phabricator.wikimedia.org/T340741) (owner: 10Btullis) [09:57:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P53307 and previous config saved to /var/cache/conftool/dbconfig/20231113-095725-arnaudb.json [09:58:05] 10SRE, 10Traffic: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [09:58:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:58:51] 10SRE, 10Traffic: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) p:05Triage→03Medium [10:00:30] (03CR) 10Vgutierrez: [C: 03+2] Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [10:00:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: graphite::production [10:02:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:02:50] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10MoritzMuehlenhoff) [10:04:20] (03CR) 10Jcrespo: [C: 03+2] sql: Migrate mediabackups metadata size from int to bigint [software/mediabackups] - 10https://gerrit.wikimedia.org/r/973364 (https://phabricator.wikimedia.org/T191804) (owner: 10Jcrespo) [10:04:34] (03CR) 10Majavah: [C: 03+1] Depool the cloudb10[13-16] hosts for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973728 (https://phabricator.wikimedia.org/T340741) (owner: 10Btullis) [10:07:09] (03CR) 10Jbond: [C: 03+1] "lgtm excluding the nits from moritz" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [10:07:46] (03PS4) 10Slyngshede: P:idp:services add Catalyst OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725) [10:09:07] (03CR) 10Jbond: [V: 03+1 C: 03+2] etcd: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [10:13:12] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:14:00] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:15:01] (03PS4) 10Volans: external clouds: allow to get prefixes from RIPE [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) [10:19:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973323 (owner: 10EoghanGaffney) [10:21:00] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:16] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [10:22:26] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:58] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 40 connections established with conf1007.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [10:23:02] PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 5 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:23:02] PROBLEM - PyBal connections to etcd on lvs5006 is CRITICAL: CRITICAL: 7 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [10:23:54] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 42 connections established with conf1007.eqiad.wmnet:4001 (min=128) https://wikitech.wikimedia.org/wiki/PyBal [10:24:12] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:24:37] ^ jbond [10:24:37] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [10:25:26] maybe it is expected? [10:25:46] PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 4 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:26:30] PROBLEM - PyBal connections to etcd on lvs4010 is CRITICAL: CRITICAL: 4 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [10:26:42] PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 1 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [10:27:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'depool T350458', diff saved to https://phabricator.wikimedia.org/P53308 and previous config saved to /var/cache/conftool/dbconfig/20231113-102730-arnaudb.json [10:27:34] T350458: Decommission db11[26-49] - https://phabricator.wikimedia.org/T350458 [10:27:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P53309 and previous config saved to /var/cache/conftool/dbconfig/20231113-102739-arnaudb.json [10:27:51] (03CR) 10Jcrespo: "All lvs are complaining, could it be related or is it expected?" [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [10:29:32] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 13 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [10:30:00] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:31:32] jynus: thanks reverting [10:31:32] (03PS1) 10Jbond: Revert "etcd: update to use shared SSL CA" [puppet] - 10https://gerrit.wikimedia.org/r/973340 [10:31:38] PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 30 connections established with conf2004.codfw.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal [10:31:40] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:31:44] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "etcd: update to use shared SSL CA" [puppet] - 10https://gerrit.wikimedia.org/r/973340 (owner: 10Jbond) [10:31:55] (03CR) 10Btullis: [C: 03+2] Depool the cloudb10[13-16] hosts for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973728 (https://phabricator.wikimedia.org/T340741) (owner: 10Btullis) [10:32:12] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [10:34:06] (03PS4) 10Fabfur: haproxy: re-set varnish maxconn for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) [10:34:15] (03PS5) 10Hnowlan: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:35:08] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:52] (03PS1) 10Jbond: etcd: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/973341 (https://phabricator.wikimedia.org/T340741) [10:36:03] (03PS6) 10Hnowlan: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:37:02] PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [10:37:09] (03CR) 10Hnowlan: [C: 03+1] "I've added one minor fix, otherwise lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:38:16] PROBLEM - PyBal connections to etcd on lvs2011 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:38:34] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 983 [10:38:38] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) p:05Triage→03High [10:40:03] (03PS1) 10Vgutierrez: Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) [10:42:22] (03CR) 10Hnowlan: [C: 03+2] wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:42:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53311 and previous config saved to /var/cache/conftool/dbconfig/20231113-104245-arnaudb.json [10:42:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:42:50] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:43:01] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:44:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:45:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:45:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:45:02] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:28] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:45:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T348183)', diff saved to https://phabricator.wikimedia.org/P53312 and previous config saved to /var/cache/conftool/dbconfig/20231113-104534-arnaudb.json [10:48:38] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:56] (03PS1) 10Majavah: scap: remove references to cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/973735 (https://phabricator.wikimedia.org/T326266) [10:49:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T348183)', diff saved to https://phabricator.wikimedia.org/P53313 and previous config saved to /var/cache/conftool/dbconfig/20231113-104917-arnaudb.json [10:49:22] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:49:49] !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudmetrics[1003-1004].eqiad.wmnet [10:49:56] (03Abandoned) 10Hashar: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur) [10:49:58] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [10:50:01] (03Abandoned) 10Hashar: Add version print option [software/purged] - 10https://gerrit.wikimedia.org/r/962670 (https://phabricator.wikimedia.org/T347839) (owner: 10Fabfur) [10:50:20] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal, AS64600/IPv4: Active - PyBal, AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:50:55] (03PS1) 10Slyngshede: C:prometheus::ethtool_exporter suppress logging [puppet] - 10https://gerrit.wikimedia.org/r/973736 (https://phabricator.wikimedia.org/T351068) [10:50:58] !log roll restart pybal after failed etcd cr [10:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:08] RECOVERY - PyBal connections to etcd on lvs4010 is OK: OK: 16 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [10:52:16] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 97 connections established with conf2004.codfw.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal [10:52:26] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:52:26] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 82 connections established with conf1007.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [10:52:26] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [10:52:26] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 128 connections established with conf1007.eqiad.wmnet:4001 (min=128) https://wikitech.wikimedia.org/wiki/PyBal [10:52:28] RECOVERY - PyBal connections to etcd on lvs2011 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:52:28] RECOVERY - PyBal connections to etcd on lvs2012 is OK: OK: 6 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [10:52:28] RECOVERY - PyBal connections to etcd on lvs4009 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [10:52:28] RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:52:28] RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:52:29] RECOVERY - PyBal connections to etcd on lvs5006 is OK: OK: 16 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [10:52:29] RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [10:53:40] RECOVERY - Check systemd state on ganeti1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:59] (03CR) 10Slyngshede: "I'd like a prettier solution, but that will require a bit of hacking around in the exporter, so for now this is the quickest solution for " [puppet] - 10https://gerrit.wikimedia.org/r/973736 (https://phabricator.wikimedia.org/T351068) (owner: 10Slyngshede) [10:55:04] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:56:17] (03CR) 10Fabfur: [C: 03+2] haproxy: re-set varnish maxconn for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur) [10:57:29] (03PS1) 10Hashar: Archive repository [software/purged] - 10https://gerrit.wikimedia.org/r/973738 (https://phabricator.wikimedia.org/T347623) [10:57:36] (03CR) 10CI reject: [V: 04-1] Archive repository [software/purged] - 10https://gerrit.wikimedia.org/r/973738 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar) [10:57:46] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [10:57:48] (03CR) 10Hashar: [V: 03+2 C: 03+2] Archive repository [software/purged] - 10https://gerrit.wikimedia.org/r/973738 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar) [10:57:55] (03CR) 10CI reject: [V: 04-1] Archive repository [software/purged] - 10https://gerrit.wikimedia.org/r/973738 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar) [10:58:55] (03PS4) 10ArielGlenn: use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [11:00:05] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudmetrics[1003-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [11:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1100) [11:00:58] (03CR) 10Jbond: [C: 03+2] etcd: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/973341 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [11:01:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:01:12] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudmetrics[1003-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [11:01:12] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:01:13] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudmetrics[1003-1004].eqiad.wmnet [11:01:22] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [11:01:28] (03PS1) 10Filippo Giunchedi: hieradata: add thanos_oidc to idp [puppet] - 10https://gerrit.wikimedia.org/r/973739 (https://phabricator.wikimedia.org/T331512) [11:01:30] (03PS1) 10Filippo Giunchedi: oauth2_proxy: new module [puppet] - 10https://gerrit.wikimedia.org/r/973740 (https://phabricator.wikimedia.org/T331512) [11:01:32] (03PS1) 10Filippo Giunchedi: thanos: add oidc support via oauth2-proxy [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) [11:02:04] (03Abandoned) 10Jcrespo: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [11:02:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973736 (https://phabricator.wikimedia.org/T351068) (owner: 10Slyngshede) [11:02:16] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: webperf [11:03:06] (03CR) 10Slyngshede: [C: 03+2] C:prometheus::ethtool_exporter suppress logging [puppet] - 10https://gerrit.wikimedia.org/r/973736 (https://phabricator.wikimedia.org/T351068) (owner: 10Slyngshede) [11:03:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973735 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [11:03:26] (03PS1) 10Muehlenhoff: Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/973743 [11:03:33] (03CR) 10Majavah: [C: 03+2] scap: remove references to cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/973735 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [11:03:56] (03CR) 10Filippo Giunchedi: "I'll add the secrets to private.git before merging" [puppet] - 10https://gerrit.wikimedia.org/r/973739 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [11:04:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P53314 and previous config saved to /var/cache/conftool/dbconfig/20231113-110424-arnaudb.json [11:04:41] (03PS1) 10Muehlenhoff: Switch webperf to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973744 (https://phabricator.wikimedia.org/T349619) [11:04:54] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 5 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [11:05:05] * jbond looking [11:05:09] (03PS1) 10Majavah: site: remove references to cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/973745 (https://phabricator.wikimedia.org/T351077) [11:05:17] 10Puppet, 10iPoid-Service: Rename FEED_API_KEY - https://phabricator.wikimedia.org/T350903 (10jijiki) 05Open→03Resolved a:03jijiki This was merged on Friday on the puppetmaster [11:05:28] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:05:34] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:05:38] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:52] (03PS1) 10Hnowlan: wmnet: fix typo [dns] - 10https://gerrit.wikimedia.org/r/973746 (https://phabricator.wikimedia.org/T349796) [11:06:09] (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [11:06:16] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/973743 (owner: 10Muehlenhoff) [11:06:19] uh? [11:06:22] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [11:06:27] vgutierrez: its back up [11:06:34] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:06:36] ack the alert [11:06:38] !alerts [11:06:44] RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:50] ah, it recovered [11:07:33] 10Puppet, 10iPoid-Service, 10serviceops: Rename FEED_API_KEY - https://phabricator.wikimedia.org/T350903 (10jijiki) [11:07:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb1013.eqiad.wmnet [11:08:04] (03CR) 10Muehlenhoff: [C: 03+2] Switch webperf to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973744 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:08:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/973743 (owner: 10Muehlenhoff) [11:08:48] RECOVERY - PyBal connections to etcd on lvs2012 is OK: OK: 6 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [11:08:53] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:47] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [11:10:56] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:09] (EtcdReplicationDown) resolved: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [11:11:55] (03PS1) 10Hashar: Archive repository [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/973747 (https://phabricator.wikimedia.org/T347623) [11:12:03] (03CR) 10CI reject: [V: 04-1] Archive repository [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/973747 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar) [11:12:48] (03CR) 10Hashar: [V: 03+2 C: 03+2] Archive repository [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/973747 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar) [11:12:55] (03CR) 10CI reject: [V: 04-1] Archive repository [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/973747 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar) [11:12:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: webperf [11:13:38] we are having spikes of 5xx on all dcs: https://grafana.wikimedia.org/goto/2O_tqVSIz?orgId=1 [11:15:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:17:03] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/973350/405/install1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [11:18:36] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:19:04] looks like there was a spike in session loss around the same time https://grafana.wikimedia.org/goto/6GH8qVISk?orgId=1 [11:19:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P53315 and previous config saved to /var/cache/conftool/dbconfig/20231113-111930-arnaudb.json [11:20:38] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:20:50] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [11:20:58] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:16] spikes in sessionstore latency and session DELETEs also https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=thanos&var-site=eqiad&var-service=sessionstore&var-prometheus=k8s&var-container_name=kask-production&from=1699872241385&to=1699874092547 [11:22:43] hnowlan: during that windo i deplyed a cr to etcd which caused etcd to restart. i then did a rolling restart of pybal with a 10 second sleep to ensure pybal reconnected correctly [11:23:03] this also caused etcdmirror to go down briefly [11:23:33] jbond: ah [11:24:46] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [11:28:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host prometheus4002.ulsfo.wmnet [11:30:06] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [11:30:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1013.eqiad.wmnet [11:32:03] (03PS1) 10Muehlenhoff: Switch prometheus4002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973748 (https://phabricator.wikimedia.org/T349619) [11:33:02] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:33:16] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb1014.eqiad.wmnet [11:33:39] (03CR) 10Muehlenhoff: [C: 03+2] Switch prometheus4002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973748 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:34:09] (03PS2) 10Jbond: sre.hosts.reimage: reimage with current puppet version unless new [cookbooks] - 10https://gerrit.wikimedia.org/r/973315 [11:34:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T348183)', diff saved to https://phabricator.wikimedia.org/P53316 and previous config saved to /var/cache/conftool/dbconfig/20231113-113437-arnaudb.json [11:34:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [11:34:42] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:34:52] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [11:34:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T348183)', diff saved to https://phabricator.wikimedia.org/P53317 and previous config saved to /var/cache/conftool/dbconfig/20231113-113458-arnaudb.json [11:35:26] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:36:02] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [11:37:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1014.eqiad.wmnet [11:37:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T348183)', diff saved to https://phabricator.wikimedia.org/P53318 and previous config saved to /var/cache/conftool/dbconfig/20231113-113751-arnaudb.json [11:38:24] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:38:44] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:38:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] etcd: update to use shared SSL CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [11:39:18] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:39:53] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb1015.eqiad.wmnet [11:40:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [11:42:42] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [11:42:51] (03PS1) 10Jbond: networks: fix whitespace [puppet] - 10https://gerrit.wikimedia.org/r/973750 [11:43:19] (03CR) 10Ayounsi: [C: 03+1] networks: fix whitespace [puppet] - 10https://gerrit.wikimedia.org/r/973750 (owner: 10Jbond) [11:43:34] (03CR) 10Jbond: [C: 03+2] networks: fix whitespace [puppet] - 10https://gerrit.wikimedia.org/r/973750 (owner: 10Jbond) [11:43:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1015.eqiad.wmnet [11:43:58] PROBLEM - Host clouddb1015 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:06] RECOVERY - Host clouddb1015 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:45:03] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: remove backwards compat [puppet] - 10https://gerrit.wikimedia.org/r/957721 (owner: 10Majavah) [11:45:14] (03PS4) 10Hnowlan: service, conftool: add mw-jobrunner config [puppet] - 10https://gerrit.wikimedia.org/r/972442 (https://phabricator.wikimedia.org/T349796) [11:45:16] (03PS4) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy config and pki [puppet] - 10https://gerrit.wikimedia.org/r/973323 [11:45:26] PROBLEM - MariaDB Replica SQL: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:45:26] PROBLEM - MariaDB Replica IO: s6 on clouddb1015 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:45:59] (03CR) 10Aqu: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [11:46:02] PROBLEM - MariaDB read only s4 on clouddb1015 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:46:08] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s4.service,wmf-pt-kill@s6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:20] PROBLEM - mysqld processes on clouddb1015 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:47:18] RECOVERY - MariaDB read only s4 on clouddb1015 is OK: Version 10.6.14-MariaDB, Uptime 49s, read_only: True, event_scheduler: False, 317.25 QPS, connection latency: 0.004221s, query latency: 0.000538s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:47:28] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:47:34] RECOVERY - mysqld processes on clouddb1015 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:47:54] RECOVERY - MariaDB Replica IO: s6 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:47:54] RECOVERY - MariaDB Replica SQL: s4 on clouddb1015 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:48:48] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:08] PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:12] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/407/console" [puppet] - 10https://gerrit.wikimedia.org/r/973750 (owner: 10Jbond) [11:51:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:20] PROBLEM - Check systemd state on ms-be1061 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973750 (owner: 10Jbond) [11:52:04] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1060 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:52:24] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:52:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P53319 and previous config saved to /var/cache/conftool/dbconfig/20231113-115257-arnaudb.json [11:53:54] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:53:59] (PuppetZeroResources) firing: Puppet has failed generate resources on ganeti1030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:55:14] (03CR) 10MVernon: [C: 03+2] admin: add ecarg to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/973324 (https://phabricator.wikimedia.org/T350818) (owner: 10Hnowlan) [11:55:59] (03CR) 10Aqu: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [11:56:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:24] PROBLEM - Check systemd state on ganeti1032 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:12] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1061 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:59:13] jouncebot: nowandnext [11:59:13] For the next 0 hour(s) and 0 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1100) [11:59:13] In 2 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1400) [11:59:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host prometheus4002.ulsfo.wmnet [11:59:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968397 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [12:00:44] (^ beta-only fwiw) [12:00:58] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Remove values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968397 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [12:02:42] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to WMF for ecarg - https://phabricator.wikimedia.org/T350818 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Done :) [12:02:49] (03PS1) 10Cathal Mooney: Add netboot config for new private vlans in codfw rows A/B [puppet] - 10https://gerrit.wikimedia.org/r/973752 (https://phabricator.wikimedia.org/T327938) [12:03:08] RECOVERY - Check systemd state on ms-be1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:23] (03CR) 10CI reject: [V: 04-1] Add netboot config for new private vlans in codfw rows A/B [puppet] - 10https://gerrit.wikimedia.org/r/973752 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [12:03:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on ganeti1030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:04:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1016.eqiad.wmnet [12:05:25] (03PS2) 10Cathal Mooney: Add netboot config for new private vlans in codfw rows A/B [puppet] - 10https://gerrit.wikimedia.org/r/973752 (https://phabricator.wikimedia.org/T327938) [12:05:57] (03CR) 10CI reject: [V: 04-1] Add netboot config for new private vlans in codfw rows A/B [puppet] - 10https://gerrit.wikimedia.org/r/973752 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [12:06:07] (03PS1) 10Btullis: Repool the cloudb10[13-16] hosts following maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973343 [12:06:15] (03CR) 10Nikerabbit: [C: 03+1] testwiki: Enable the Unified Content Translation Dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973170 (https://phabricator.wikimedia.org/T337915) (owner: 10KartikMistry) [12:07:28] (03PS2) 10Btullis: Repool the cloudb10[13-16] hosts following maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973343 (https://phabricator.wikimedia.org/T344590) [12:07:38] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973343 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [12:08:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P53320 and previous config saved to /var/cache/conftool/dbconfig/20231113-120803-arnaudb.json [12:08:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host kafka-jumbo1007.eqiad.wmnet [12:08:20] (03CR) 10Jbond: [C: 04-1] "lgtm -1 is just on the title, feel free to reject the others" [puppet] - 10https://gerrit.wikimedia.org/r/973740 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [12:09:54] (03PS1) 10Muehlenhoff: Switch kafka-jumbo1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973753 (https://phabricator.wikimedia.org/T349619) [12:10:01] (03PS3) 10Cathal Mooney: Add netboot config for new private vlans in codfw rows A/B [puppet] - 10https://gerrit.wikimedia.org/r/973752 (https://phabricator.wikimedia.org/T327938) [12:10:25] (03CR) 10Jbond: [C: 03+1] [apt-staging] Fix apt_staging.yaml, add envoy config and pki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973323 (owner: 10EoghanGaffney) [12:10:32] (03CR) 10Ayounsi: [C: 03+1] Add netboot config for new private vlans in codfw rows A/B [puppet] - 10https://gerrit.wikimedia.org/r/973752 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [12:10:56] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10MatthewVernon) @thcipriani you're listed as the approver for the `restricted` group. Can you approve (or otherwi... [12:11:16] (03CR) 10Btullis: [C: 03+2] Update the datahub images to address CVE-2023-4911 [deployment-charts] - 10https://gerrit.wikimedia.org/r/973142 (https://phabricator.wikimedia.org/T348647) (owner: 10Btullis) [12:12:23] (03Merged) 10jenkins-bot: Update the datahub images to address CVE-2023-4911 [deployment-charts] - 10https://gerrit.wikimedia.org/r/973142 (https://phabricator.wikimedia.org/T348647) (owner: 10Btullis) [12:12:26] (03CR) 10Majavah: [C: 03+1] Repool the cloudb10[13-16] hosts following maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973343 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [12:12:43] (03CR) 10Muehlenhoff: [C: 03+2] Switch kafka-jumbo1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973753 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:13:14] (03PS3) 10Btullis: Repool the cloudb10[13-16] hosts following maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973343 (https://phabricator.wikimedia.org/T344590) [12:13:20] (03CR) 10Cathal Mooney: [C: 03+2] Add netboot config for new private vlans in codfw rows A/B [puppet] - 10https://gerrit.wikimedia.org/r/973752 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [12:13:23] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973343 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [12:14:06] (03Abandoned) 10Slyngshede: P:idp:services add Catalyst OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725) (owner: 10Slyngshede) [12:15:04] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/973716 (https://phabricator.wikimedia.org/T350945) (owner: 10Brouberol) [12:15:58] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1004.eqiad.wmnet with OS bullseye [12:16:04] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ayounsi@cumin1001 for host sretest1004.eqiad.wmnet with OS bullseye [12:16:13] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:16:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host kafka-jumbo1007.eqiad.wmnet [12:17:08] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:41] (03CR) 10Btullis: [C: 03+2] Repool the cloudb10[13-16] hosts following maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973343 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [12:18:35] (03CR) 10Hnowlan: service, conftool: add mw-jobrunner config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972442 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:18:37] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) I decided we should move about 10% of the mobileapps traffic at a time; that means about 300 rps, which I think we should be able to serve moving over about 2-3 api serve... [12:19:15] (03PS3) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) [12:19:44] !log cmooney@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest2003.codfw.wmnet [12:21:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host centrallog2002.codfw.wmnet [12:21:24] PROBLEM - Check systemd state on failoid1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:58] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:36] (03PS1) 10Hnowlan: trafficserver: return traffic to editor-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/973758 (https://phabricator.wikimedia.org/T350747) [12:23:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T348183)', diff saved to https://phabricator.wikimedia.org/P53321 and previous config saved to /var/cache/conftool/dbconfig/20231113-122310-arnaudb.json [12:23:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:23:16] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 31.86 ms [12:23:26] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:23:26] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:23:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T348183)', diff saved to https://phabricator.wikimedia.org/P53322 and previous config saved to /var/cache/conftool/dbconfig/20231113-122332-arnaudb.json [12:24:12] (03PS1) 10Majavah: Add wiki replica backends to conftool [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) [12:24:14] (03PS1) 10Majavah: WIP: add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 [12:24:37] (03PS1) 10Kamila Součková: kube-state-metrics: reduce number of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/973762 (https://phabricator.wikimedia.org/T264625) [12:24:47] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10MatthewVernon) [12:24:58] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10MatthewVernon) ssh pubkey confirmed out-of-band. [12:25:03] (03CR) 10CI reject: [V: 04-1] WIP: add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (owner: 10Majavah) [12:25:14] RECOVERY - Check systemd state on ganeti1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:24] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:26:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T348183)', diff saved to https://phabricator.wikimedia.org/P53323 and previous config saved to /var/cache/conftool/dbconfig/20231113-122627-arnaudb.json [12:26:56] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (owner: 10Majavah) [12:27:16] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 31.85 ms [12:27:27] (03PS2) 10Majavah: WIP: add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 [12:27:29] (03PS1) 10Muehlenhoff: Switch centrallog2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973763 (https://phabricator.wikimedia.org/T349619) [12:27:46] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [12:28:02] RECOVERY - Check systemd state on failoid1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:15] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [12:28:34] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1061 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:28:34] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [12:28:39] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1004.eqiad.wmnet with OS bullseye [12:28:44] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ayounsi@cumin1001 for host sretest1004.eqiad.wmnet with OS bullseye executed with errors: - sretest1004 (**FAIL**) - Removed from Puppet and Puppet... [12:29:01] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [12:29:50] 10SRE, 10SRE-Access-Requests: Requesting access to WMF for Grace - https://phabricator.wikimedia.org/T350918 (10MatthewVernon) The `ecarg` account was added to the WMF ldap group in T350818 - is there any further access required? [12:30:31] (03PS3) 10Majavah: WIP: add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 [12:30:33] (03CR) 10Kamila Součková: "Based on https://phabricator.wikimedia.org/T264625#9324445" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973762 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:31:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest2003.codfw.wmnet [12:32:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (owner: 10Majavah) [12:32:23] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [12:32:36] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:49] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bullseye [12:32:56] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye [12:33:18] (03CR) 10Brouberol: [C: 03+2] Re-generate the skein certificates during business days [puppet] - 10https://gerrit.wikimedia.org/r/973716 (https://phabricator.wikimedia.org/T350945) (owner: 10Brouberol) [12:34:14] !log restarting memcached on mc2038 [12:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:26] (03CR) 10Muehlenhoff: [C: 03+2] Switch centrallog2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973763 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:35:25] (03PS4) 10Majavah: WIP: add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 [12:35:34] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [12:36:55] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:38:23] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 31.86 ms [12:38:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host centrallog2002.codfw.wmnet [12:41:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P53324 and previous config saved to /var/cache/conftool/dbconfig/20231113-124133-arnaudb.json [12:42:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:42:36] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bullseye [12:42:42] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [12:43:10] (03PS1) 10Btullis: Depool clouddb10[17-20] for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973768 (https://phabricator.wikimedia.org/T344590) [12:43:16] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye [12:43:25] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye [12:44:42] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973768 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [12:45:05] (03CR) 10Ladsgroup: use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [12:46:53] (03PS1) 10Majavah: cr-labs: permit cloudlb to wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/973769 (https://phabricator.wikimedia.org/T300427) [12:49:09] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:53:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm [12:55:35] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet [12:56:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P53325 and previous config saved to /var/cache/conftool/dbconfig/20231113-125640-arnaudb.json [12:56:49] (03CR) 10Btullis: "The pcc run is interesting for this." [puppet] - 10https://gerrit.wikimedia.org/r/973768 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [13:03:26] (03CR) 10Majavah: [C: 03+1] Depool clouddb10[17-20] for maintenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973768 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [13:04:56] (03CR) 10JMeybohm: [C: 03+1] "LGTM if you want the canary to use mwapi-async at first (like main does)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973179 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [13:05:27] (03PS1) 10Ayounsi: Add e8/f8 subnets to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/973772 (https://phabricator.wikimedia.org/T335028) [13:06:08] (03CR) 10Ayounsi: [C: 03+2] Add e8/f8 subnets to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/973772 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:09:10] (03CR) 10Btullis: [C: 03+2] Depool clouddb10[17-20] for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973768 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [13:10:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kafka::jumbo::broker [13:11:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T348183)', diff saved to https://phabricator.wikimedia.org/P53326 and previous config saved to /var/cache/conftool/dbconfig/20231113-131146-arnaudb.json [13:11:48] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [13:11:51] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:12:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [13:12:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1210 (T348183)', diff saved to https://phabricator.wikimedia.org/P53327 and previous config saved to /var/cache/conftool/dbconfig/20231113-131207-arnaudb.json [13:12:18] (03PS1) 10Muehlenhoff: Switch kafka-jumbo to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973773 (https://phabricator.wikimedia.org/T349619) [13:15:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T348183)', diff saved to https://phabricator.wikimedia.org/P53328 and previous config saved to /var/cache/conftool/dbconfig/20231113-131556-arnaudb.json [13:17:49] (03CR) 10Muehlenhoff: [C: 03+2] Switch kafka-jumbo to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973773 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:18:23] (03CR) 10Hnowlan: [C: 03+1] envoy: Allow additional arguments to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973721 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:20:40] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Allow additional arguments to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973721 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:24:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kafka::jumbo::broker [13:25:00] (PuppetZeroResources) resolved: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:26:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:27:31] (03PS1) 10Hashar: (DO NOT SUBMIT) testing for CI [puppet] - 10https://gerrit.wikimedia.org/r/973775 [13:27:49] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973775 (owner: 10Hashar) [13:28:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:29:02] (03PS1) 10Hnowlan: api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) [13:29:43] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973775 (owner: 10Hashar) [13:30:27] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P53329 and previous config saved to /var/cache/conftool/dbconfig/20231113-133102-arnaudb.json [13:31:38] 10sre-alert-triage, 10Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10LSobanski) [13:32:31] 10sre-alert-triage, 10Traffic: Alert in need of triage: PuppetConstantChange (instance pybal-test2003:9100) - https://phabricator.wikimedia.org/T351084 (10LSobanski) [13:36:11] (03PS3) 10BBlack: varnish: remove TCP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) [13:36:13] (03PS3) 10BBlack: varnish: only listen on a single, local TCP port [puppet] - 10https://gerrit.wikimedia.org/r/957348 (https://phabricator.wikimedia.org/T333965) [13:36:54] (03Abandoned) 10BBlack: varnish: limit TCP listener to localhost [puppet] - 10https://gerrit.wikimedia.org/r/957350 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:38:30] !log installing tomcat9 security updates [13:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:52] (03CR) 10BBlack: "PCC on cp1075 looks sane to me:" [puppet] - 10https://gerrit.wikimedia.org/r/957348 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:41:02] (03Abandoned) 10Clément Goubert: team-sre: Add warning for CentralAuth job lag [alerts] - 10https://gerrit.wikimedia.org/r/935078 (https://phabricator.wikimedia.org/T336627) (owner: 10Clément Goubert) [13:42:20] !log installing nghttp2 security updates [13:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:31] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Fix apt_staging.yaml, add envoy config and pki [puppet] - 10https://gerrit.wikimedia.org/r/973323 (owner: 10EoghanGaffney) [13:42:50] (03PS5) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [13:42:54] (03PS1) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) [13:43:59] (03PS1) 10EoghanGaffney: [apt_repo] Ensure that parent directories of basedir exist [puppet] - 10https://gerrit.wikimedia.org/r/973778 [13:44:40] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/957348 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:45:14] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [13:45:16] !log restarting FPM/Apache on mw canaries [13:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:00] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: requesttracker [13:46:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P53330 and previous config saved to /var/cache/conftool/dbconfig/20231113-134608-arnaudb.json [13:47:43] (03PS1) 10Muehlenhoff: Switch requesttracker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973779 (https://phabricator.wikimedia.org/T349619) [13:48:23] (03PS6) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [13:48:45] jouncebot: next [13:48:45] In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1400) [13:48:58] (03CR) 10Fabfur: [C: 03+1] varnish: remove TCP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:50:08] (03CR) 10JMeybohm: [C: 03+1] kube-state-metrics: reduce number of metrics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973762 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [13:50:43] (03CR) 10Muehlenhoff: [C: 03+2] Switch requesttracker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973779 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:51:05] (03PS7) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [13:52:21] (03CR) 10BBlack: [C: 03+2] varnish: remove TCP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:52:28] (03CR) 10BBlack: [C: 03+2] varnish: only listen on a single, local TCP port [puppet] - 10https://gerrit.wikimedia.org/r/957348 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [13:52:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [13:53:48] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bullseye [13:53:54] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye... [13:54:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: requesttracker [13:57:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1400). [14:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] i can deploy today [14:00:19] If possible I have one to add late. [14:00:26] hi [14:00:26] Dreamy_Jazz: add it to the cal [14:00:29] hey MatmaRex [14:00:34] i might have to leave early in a couple of minutes [14:00:46] (03CR) 10Urbanecm: [C: 03+2] ParserOutputAccess: Limit local cache size [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973339 (https://phabricator.wikimedia.org/T315510) (owner: 10Bartosz Dziewoński) [14:00:48] Was adding it to the list when I got distracted by other work :) [14:01:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T348183)', diff saved to https://phabricator.wikimedia.org/P53331 and previous config saved to /var/cache/conftool/dbconfig/20231113-140115-arnaudb.json [14:01:17] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [14:01:20] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:01:27] urbanecm: should i reschedule, or would you be okay starting those scripts without me? [14:01:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [14:01:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53332 and previous config saved to /var/cache/conftool/dbconfig/20231113-140136-arnaudb.json [14:01:39] (03PS1) 10Slyngshede: P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) [14:01:41] MatmaRex: assuming there's no other way to test this aside from running the DT script, i actually don't need you to be around for the deployment of this. [14:02:35] (03CR) 10CI reject: [V: 04-1] P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:03:11] Dreamy_Jazz: waiting for your patch(es) to be added, while CI does its magic :) [14:03:22] It's a request for a maintenance script run [14:03:23] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:03:30] Dreamy_Jazz: ah, which one? [14:03:35] createExtensionTables.json [14:03:36] and do you have a task for this? [14:03:39] Yes [14:03:53] link please :) [14:03:54] https://phabricator.wikimedia.org/T350321 [14:03:58] ty [14:04:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53333 and previous config saved to /var/cache/conftool/dbconfig/20231113-140427-arnaudb.json [14:05:11] urbanecm: okay. thank you [14:05:27] i need to step out for a moment then. i'll be back in maybe 20 minutes :) [14:05:29] thanks again [14:05:56] ack [14:05:57] (03PS5) 10Filippo Giunchedi: prometheus-puppet-agent-stats: this timer sometime fails [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [14:06:18] (03PS1) 10Majavah: cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) [14:06:20] (03PS1) 10Majavah: P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) [14:06:37] Added to the calendar the command [14:06:45] Hopefully that is the correct syntax [14:06:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb1017.eqiad.wmnet [14:07:07] Dreamy_Jazz: well yeah, but i need to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/972384/ first [14:07:17] Oh. [14:07:28] (03PS1) 10Urbanecm: Add MediaModeration to createExtensionTables.php [extensions/WikimediaMaintenance] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973345 (https://phabricator.wikimedia.org/T350321) [14:07:33] (03CR) 10Urbanecm: [C: 03+2] Add MediaModeration to createExtensionTables.php [extensions/WikimediaMaintenance] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973345 (https://phabricator.wikimedia.org/T350321) (owner: 10Urbanecm) [14:07:39] Thanks. [14:07:43] (03PS1) 10Urbanecm: Add MediaModeration to addWiki.php [extensions/WikimediaMaintenance] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973786 (https://phabricator.wikimedia.org/T350321) [14:07:47] (03CR) 10Urbanecm: [C: 03+2] Add MediaModeration to addWiki.php [extensions/WikimediaMaintenance] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973786 (https://phabricator.wikimedia.org/T350321) (owner: 10Urbanecm) [14:07:56] and the addwiki one just in case someone creates a wiki this week [14:08:02] so that you don't get drifts [14:08:26] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/427/con" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [14:08:39] Sure. [14:08:54] (03CR) 10Filippo Giunchedi: "I've tested this in Pontoon and the latest PS works as intended for me." [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [14:09:25] Dreamy_Jazz: the table's private, right? [14:09:33] (03CR) 10CI reject: [V: 04-1] cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [14:09:56] Yes. As it's on extension1 it should be fine to create now (according to Ladsgroup). [14:10:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10VRiley-WMF) 05Open→03Resolved [14:10:33] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [14:10:39] As I was told no wikidumps are taken from extension1 at the moment. [14:10:50] (03PS1) 10DDesouza: Deploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973784 (https://phabricator.wikimedia.org/T345951) [14:10:55] (03PS1) 10Majavah: hieradata: migrate codfw cloudlb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) [14:10:57] (03PS1) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) [14:11:05] Dreamy_Jazz: that's true, but AFAIK the table is still expected to be listed in `$private_tables` in puppet, in case that changes. might be wrong about that though. [14:12:01] Sent the message that Ladsgroup sent me on slack to you on slack. [14:13:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/429/con" [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [14:13:59] Dreamy_Jazz: thanks. that clarifies, creating. [14:14:05] well, once it merges. [14:14:12] :) [14:14:45] (03PS2) 10Majavah: Add wiki replica backends to conftool [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) [14:14:47] (03PS2) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) [14:14:49] (03PS8) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [14:15:30] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973786|Add MediaModeration to addWiki.php (T350321)]], [[gerrit:973345|Add MediaModeration to createExtensionTables.php (T350321)]] [14:15:37] T350321: [M] Create database table to store status of scans - https://phabricator.wikimedia.org/T350321 [14:16:15] Dreamy_Jazz: oh congrats, hadn't seen that you started working at wmf! [14:16:24] Thanks! [14:16:33] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/430/con" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [14:16:43] Yeah, I'm working with the Trust and Safety Product team as a contractor. [14:16:48] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:973786|Add MediaModeration to addWiki.php (T350321)]], [[gerrit:973345|Add MediaModeration to createExtensionTables.php (T350321)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:16:58] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:17:06] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10Ladsgroup) >>! In T350924#9323184, @MatthewVernon wrote: > I don't want to tie anyone up in red tape, but I think it'd be good to have a lightweight process to ensure this doesn't just become a dustbin... [14:17:51] 10SRE, 10SRE-Access-Requests: Requesting access to WMF for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10Aklapper) [14:18:24] (03Merged) 10jenkins-bot: ParserOutputAccess: Limit local cache size [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973339 (https://phabricator.wikimedia.org/T315510) (owner: 10Bartosz Dziewoński) [14:18:32] If possible I'd like to add a patch for this window. If not I can add it to the next window. [14:19:09] danisztls: we'll see about that. can you add it to the calendar? [14:19:15] (03CR) 10CI reject: [V: 04-1] Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [14:19:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P53334 and previous config saved to /var/cache/conftool/dbconfig/20231113-141934-arnaudb.json [14:19:56] (03CR) 10Filippo Giunchedi: "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/973740 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [14:20:20] (03PS2) 10Filippo Giunchedi: oauth2_proxy: new module [puppet] - 10https://gerrit.wikimedia.org/r/973740 (https://phabricator.wikimedia.org/T331512) [14:20:22] (03PS2) 10Filippo Giunchedi: thanos: add oidc support via oauth2-proxy [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) [14:20:48] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/431/console" [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:21:37] urbanecm: added [14:22:29] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:973786|Add MediaModeration to addWiki.php (T350321)]], [[gerrit:973345|Add MediaModeration to createExtensionTables.php (T350321)]] (duration: 06m 58s) [14:22:33] T350321: [M] Create database table to store status of scans - https://phabricator.wikimedia.org/T350321 [14:23:11] (03PS2) 10Hashar: (DO NOT SUBMIT) testing for CI (PS2) [puppet] - 10https://gerrit.wikimedia.org/r/973775 [14:24:05] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973339|ParserOutputAccess: Limit local cache size (T315510)]] [14:24:08] (03CR) 10Kamila Součková: [C: 03+2] kube-state-metrics: reduce number of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/973762 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [14:24:09] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:25:05] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [14:25:18] (03PS2) 10Slyngshede: P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) [14:25:23] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [14:25:33] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [14:25:47] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [14:26:55] (03Merged) 10jenkins-bot: kube-state-metrics: reduce number of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/973762 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [14:26:57] (03PS2) 10Majavah: cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) [14:26:59] (03PS2) 10Majavah: P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) [14:27:01] PROBLEM - MariaDB Replica SQL: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:27:01] (03PS2) 10Majavah: hieradata: migrate codfw cloudlb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) [14:27:03] (03PS2) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) [14:27:05] (03PS1) 10DDesouza: Undeploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973808 (https://phabricator.wikimedia.org/T349854) [14:27:13] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:27:23] PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:27:32] (03PS1) 10Urbanecm: Add mediamoderation_scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973788 (https://phabricator.wikimedia.org/T350321) [14:27:36] (03CR) 10Urbanecm: [C: 03+2] Add mediamoderation_scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973788 (https://phabricator.wikimedia.org/T350321) (owner: 10Urbanecm) [14:27:45] (03CR) 10CI reject: [V: 04-1] cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [14:27:51] Dreamy_Jazz: and one more backport [14:27:55] Thanks [14:28:03] PROBLEM - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service,wmf-pt-kill@s3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:12] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:28:24] (03CR) 10Urbanecm: [C: 03+2] Deploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973784 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [14:28:27] RECOVERY - MariaDB Replica SQL: s1 on clouddb1017 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:28:51] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:29:12] (03Merged) 10jenkins-bot: Deploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973784 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [14:29:25] (03PS3) 10Majavah: cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) [14:29:27] (03PS3) 10Majavah: P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) [14:29:29] (03PS3) 10Majavah: hieradata: migrate codfw cloudlb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) [14:29:31] (03PS3) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) [14:30:04] (03Merged) 10jenkins-bot: Add mediamoderation_scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973788 (https://phabricator.wikimedia.org/T350321) (owner: 10Urbanecm) [14:30:45] !log installing debianutils bugfix updates from Bookworm point release [14:30:47] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:973339|ParserOutputAccess: Limit local cache size (T315510)]] (duration: 06m 42s) [14:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:58] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:31:11] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973784|Deploy Reader Demographics 2 survey (T345951)]], [[gerrit:973788|Add mediamoderation_scan table (T350321)]] [14:31:15] (03CR) 10Fabfur: [C: 03+1] trafficserver: return traffic to editor-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/973758 (https://phabricator.wikimedia.org/T350747) (owner: 10Hnowlan) [14:31:20] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [14:31:20] T350321: [M] Create database table to store status of scans - https://phabricator.wikimedia.org/T350321 [14:32:05] (03PS1) 10Dreamy Jazz: Add mediamoderation_scan to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/973809 (https://phabricator.wikimedia.org/T350321) [14:32:18] (03CR) 10Jbond: [C: 04-1] [apt_repo] Ensure that parent directories of basedir exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973778 (owner: 10EoghanGaffney) [14:32:28] !log urbanecm@deploy2002 urbanecm and dani: Backport for [[gerrit:973784|Deploy Reader Demographics 2 survey (T345951)]], [[gerrit:973788|Add mediamoderation_scan table (T350321)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:32:41] (03PS5) 10ArielGlenn: use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [14:32:57] (03CR) 10CI reject: [V: 04-1] cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [14:33:01] (03PS4) 10Majavah: cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) [14:33:03] (03PS4) 10Majavah: P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) [14:33:05] (03PS4) 10Majavah: hieradata: migrate codfw cloudlb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) [14:33:07] (03PS4) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) [14:33:31] (03CR) 10ArielGlenn: use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [14:33:43] (03PS2) 10Dreamy Jazz: Add mediamoderation_scan to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/973809 (https://phabricator.wikimedia.org/T350321) [14:33:50] danisztls: please test your patch at mwdebug2001 now. [14:34:10] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [14:34:17] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host clouddb1017.eqiad.wmnet [14:34:27] (03CR) 10Hnowlan: [C: 03+2] trafficserver: return traffic to editor-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/973758 (https://phabricator.wikimedia.org/T350747) (owner: 10Hnowlan) [14:34:31] (03CR) 10Urbanecm: [C: 03+1] Add mediamoderation_scan to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/973809 (https://phabricator.wikimedia.org/T350321) (owner: 10Dreamy Jazz) [14:34:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P53335 and previous config saved to /var/cache/conftool/dbconfig/20231113-143440-arnaudb.json [14:34:52] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973740 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [14:34:58] urbanecm: looks good [14:35:02] !log urbanecm@deploy2002 urbanecm and dani: Continuing with sync [14:35:06] ty, proceeding [14:35:30] PROBLEM - Check systemd state on apt-staging2001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:34] RECOVERY - MariaDB Replica Lag: s3 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:35:34] RECOVERY - Check systemd state on clouddb1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:34] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:35:40] (03CR) 10Jbond: [C: 03+2] "tested with sretest1002" [cookbooks] - 10https://gerrit.wikimedia.org/r/973315 (owner: 10Jbond) [14:36:18] (03PS5) 10Majavah: cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) [14:36:20] (03PS5) 10Majavah: P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) [14:36:22] (03PS5) 10Majavah: hieradata: migrate codfw cloudlb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) [14:36:24] (03PS5) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) [14:36:26] (03CR) 10CI reject: [V: 04-1] cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [14:36:26] urbanecm: thanks! [14:37:19] (i'm back) [14:37:52] MatmaRex: and i just started your scripts. with updated --start. [14:38:11] oh, neat. thank you [14:38:12] can you please help me with monitoring the memory stuff doesn't happen anymore? [14:38:29] !log mwmaint2002: Start several instances of `extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php` (T315510) [14:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:34] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:38:48] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [14:38:54] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:20] yeah. i'll look at that chart from time to time [14:40:02] huh, what freed 30 GB of memory this morning? https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mwmaint2002&var-datasource=thanos&var-cluster=misc&viewPanel=4&from=now-2d&to=now [14:40:25] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:973784|Deploy Reader Demographics 2 survey (T345951)]], [[gerrit:973788|Add mediamoderation_scan table (T350321)]] (duration: 09m 13s) [14:40:30] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [14:40:31] T350321: [M] Create database table to store status of scans - https://phabricator.wikimedia.org/T350321 [14:40:41] (03Merged) 10jenkins-bot: sre.hosts.reimage: reimage with current puppet version unless new [cookbooks] - 10https://gerrit.wikimedia.org/r/973315 (owner: 10Jbond) [14:40:59] danisztls: deployed [14:41:30] !log cp2027: varnish-frontend-restart to test tcp listen port changes [14:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:08] Dreamy_Jazz: assuming this looks good to you, i can create everywhere (10.192.16.34=db2096=x1) https://www.irccloud.com/pastebin/rmGh9Q2a/ [14:42:34] MatmaRex: good question. i don't know. [14:42:42] As long as that's on extension1, then yes it looks good. [14:42:49] 👍 [14:43:36] (I missed the comment about it being on x1 in your first comment based on the DB name). [14:43:46] (and now see that). [14:43:54] !log mwmaint2002: foreachwiki extensions/WikimediaMaintenance/createExtensionTables.php MediaModeration (T350321) [14:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:01] no worries [14:45:30] Dreamy_Jazz: would you mind rewriting the createExtensionTables.php in a way that doesn't fatal when re-executed? see https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/createExtensionTables.php#L103 for how that can be done. [14:46:14] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb1018.eqiad.wmnet [14:46:19] What is the fatal? [14:46:46] `Table 'mediamoderation_scan' already exists` [14:46:51] I copied the code from the growthexperiments case statement, but just modified the specific names + how the DB is acquired. [14:46:53] this in full https://www.irccloud.com/pastebin/AnZKP3Fk/ [14:47:30] Dreamy_Jazz: growthexperiments has per-table SQL files included there, but mediamoderation uses tables-generated, so it runs all the create table statements [14:47:40] I see [14:47:44] so new table creations would have to be done through loading SQL directly, not via the script [14:48:00] anyway, tables are live :) [14:48:14] Thanks! [14:48:20] I'll update the script now. [14:48:30] ty [14:49:04] by the way, the mwmaint2002 memory usage seems to be growing :/ so that might still be a problem with the maintenance scripts, unless we have very generous garbage collection config for PHP maintenance scripts or something [14:49:07] Dreamy_Jazz: i also see beta doesn't have the table. if that's not expected, please implement `LoadExtensionSchemaUpdatesHook` somewhere in your extension [14:49:18] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [14:49:21] It is implemented. [14:49:33] But probably not running because the virtual domains config is set for beta wikis I guess [14:49:41] MatmaRex: yep, frwiki's now at 8GiB... i assume that's not okay. [14:49:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53336 and previous config saved to /var/cache/conftool/dbconfig/20231113-144947-arnaudb.json [14:49:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [14:49:52] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:50:13] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [14:50:30] MatmaRex: enwiki / rowiki are behaving reasonably so far though. [14:50:40] (03PS3) 10Majavah: Add wiki replica backends to conftool [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) [14:50:41] huh, interesting [14:50:42] (03PS3) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) [14:50:43] but frwiki fails right at the first revision, 7544396 [14:50:44] (03PS9) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [14:50:51] so it seems to be data-specific [14:51:03] i'm stopping frwiki, this will fail again. [14:51:32] !log mwmaint2002: stop `extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki frwiki` again, memory leak didn't stop (T315510) [14:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:36] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:52:02] thanks [14:52:04] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [14:52:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [14:52:20] I'll make a config patch to remove the config for betawikis and use the main DB, based on the URL shortener extension not using extension1 for betawikis. Will need the puppet change merged before that config change can be made, so will create the config change and leave for a later window. [14:52:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1230 (T348183)', diff saved to https://phabricator.wikimedia.org/P53337 and previous config saved to /var/cache/conftool/dbconfig/20231113-145223-arnaudb.json [14:52:29] (this is so annoying, why in the world would it break now… eh) [14:52:31] (03PS11) 10Hashar: Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 [14:52:52] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/434/con" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [14:52:59] MatmaRex: i feel you. is there any thing i can do to help you debugging this? [14:53:51] (03CR) 10FNegri: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [14:53:54] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:03] Dreamy_Jazz: ahh, makes sense. well, once the config is correct on beta (no `wgVirtualDomainsMapping` for your ext on beta), the beta equivalent would get created automatically. [14:54:11] 👍 [14:54:23] urbanecm: not at this time. i will need to learn what is even possible to do in PHP. maybe there's some way to log what objects are taking up all that space, or something [14:55:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: introduce canary release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973179 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [14:55:09] MatmaRex: okay, ack. feel free to ping me if needed. [14:55:22] <_joe_> jouncebot: now [14:55:22] For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1400) [14:55:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T348183)', diff saved to https://phabricator.wikimedia.org/P53338 and previous config saved to /var/cache/conftool/dbconfig/20231113-145524-arnaudb.json [14:55:30] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:55:31] so it is just frwiki, the other two are not growing? that's the weirdest thing about this [14:55:36] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:55:44] anyway. not now :) thanks for deploying urbanecm [14:55:50] np :) [14:56:09] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:56:16] (03Merged) 10jenkins-bot: mobileapps: introduce canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/973179 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [14:56:23] MatmaRex: at this time, yes. i _think_ that frwiki is just at a problematic revision right now, while the other two wikis didn't encounter the issue so far. [14:56:49] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:57:07] Thanks for the deploys and backports! [14:57:09] MatmaRex: enwiki is at 880M, rowiki 212M ATM. [14:57:12] Dreamy_Jazz: any time :) [14:57:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1018.eqiad.wmnet [14:58:53] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:58:58] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:59:10] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:59:56] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:00:00] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb1019.eqiad.wmnet [15:00:28] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:00:59] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:02:16] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [15:02:34] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [15:02:50] 10Puppet, 10MediaModeration (MediaModeration 2.0): Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) [15:03:12] 10Puppet, 10MediaModeration (MediaModeration 2.0): Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) a:03Dreamy_Jazz [15:03:23] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint: Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) [15:03:31] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint: [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) [15:03:38] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [15:03:41] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) [15:03:53] (03PS3) 10Dreamy Jazz: Add mediamoderation_scan to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/973809 (https://phabricator.wikimedia.org/T350321) [15:04:28] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1102.eqiad.wmnet [15:04:28] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1102.eqiad.wmnet [15:05:55] (03PS1) 10DDesouza: Fix Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973814 (https://phabricator.wikimedia.org/T345951) [15:06:20] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:06:38] (03CR) 10Clément Goubert: [C: 03+1] mobileapps: add egress networkpolicy for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/973180 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:07:11] !log swapped cp1102 <-> cp1077 (T349244) [15:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1019.eqiad.wmnet [15:07:27] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [15:08:22] (03CR) 10Slyngshede: "Let me know if this isn't the right direction to go in." [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:08:32] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb1020.eqiad.wmnet [15:08:32] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) [15:09:04] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) [15:10:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P53339 and previous config saved to /var/cache/conftool/dbconfig/20231113-151031-arnaudb.json [15:11:12] (03PS2) 10EoghanGaffney: [apt_repo] Ensure that parent directories of basedir exist [puppet] - 10https://gerrit.wikimedia.org/r/973778 [15:11:48] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [15:11:51] (03PS2) 10Giuseppe Lavagetto: mobileapps: add egress networkpolicy for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/973180 (https://phabricator.wikimedia.org/T350846) [15:12:33] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/439/con" [puppet] - 10https://gerrit.wikimedia.org/r/973778 (owner: 10EoghanGaffney) [15:12:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [15:13:12] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:13:40] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1103.eqiad.wmnet [15:13:40] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1103.eqiad.wmnet [15:14:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1020.eqiad.wmnet [15:14:56] !log swapped cp1103 <-> cp1078 (T349244) [15:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:00] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [15:16:01] (03CR) 10Clément Goubert: [C: 03+1] mobileapps: add egress networkpolicy for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/973180 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:16:11] (03CR) 10Clément Goubert: [C: 03+1] mobileapps: switch canary to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973181 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:17:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:17:41] (03CR) 10Clément Goubert: [C: 03+1] mobileapps: move traffic to mw on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973182 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:19:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:21:52] (03PS1) 10Dreamy Jazz: Use local DB when on betawikis for 'virtual-mediamoderation' domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973816 (https://phabricator.wikimedia.org/T351096) [15:22:29] (03PS2) 10Dreamy Jazz: Use local DB when on betawikis for 'virtual-mediamoderation' domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973816 (https://phabricator.wikimedia.org/T351096) [15:22:31] (03CR) 10Clément Goubert: [C: 03+1] "LGTM, but it should be deployed before the previous patch (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/973182/) moving" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973183 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:23:59] (PuppetFailure) firing: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:25:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P53340 and previous config saved to /var/cache/conftool/dbconfig/20231113-152537-arnaudb.json [15:28:58] (03CR) 10Clément Goubert: [C: 03+1] "IMO this patch should come with another capacity raise for mw-api-int, but we can also load it more than it currently is and make a decisi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973184 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:29:39] (03CR) 10Brouberol: [V: 03+1] "I have now automated the generation of the subnet config files as well." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [15:30:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: add egress networkpolicy for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/973180 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:31:07] (03Merged) 10jenkins-bot: mobileapps: add egress networkpolicy for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/973180 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:31:15] (03CR) 10Muehlenhoff: [C: 03+1] "Group ownership request was approved in today's SRE Infrastructure Foundations meeting" [puppet] - 10https://gerrit.wikimedia.org/r/972909 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [15:31:23] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:31:41] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:31:51] (03CR) 10Majavah: Generate the netboot.cfg file to avoid typos impacting everyone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [15:32:53] (03PS54) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) [15:34:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973778 (owner: 10EoghanGaffney) [15:36:03] (03CR) 10CI reject: [V: 04-1] Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [15:38:48] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:39:07] (03CR) 10Hnowlan: [C: 03+2] service, conftool: add mw-jobrunner config [puppet] - 10https://gerrit.wikimedia.org/r/972442 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:39:58] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:40:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T348183)', diff saved to https://phabricator.wikimedia.org/P53341 and previous config saved to /var/cache/conftool/dbconfig/20231113-154044-arnaudb.json [15:40:46] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:40:48] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:40:53] (03PS1) 10Effie Mouzeli: tegola: update image to pick up OS fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/973817 (https://phabricator.wikimedia.org/T348647) [15:41:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:42:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:43:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:46:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [15:46:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issue - https://phabricator.wikimedia.org/T348272 (10jbond) p:05Triage→03Medium [15:46:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [15:46:35] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issue - https://phabricator.wikimedia.org/T348272 (10jbond) 05Open→03Invalid [15:46:39] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [15:46:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T348183)', diff saved to https://phabricator.wikimedia.org/P53342 and previous config saved to /var/cache/conftool/dbconfig/20231113-154641-arnaudb.json [15:46:45] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:48:12] 10Puppet, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10jbond) [15:49:32] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10conftool: confd fails to start after a reimage - https://phabricator.wikimedia.org/T244477 (10jbond) I have a feeling this is fixed we should see if its still present [15:49:52] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10MatthewVernon) >>! In T350924#9326604, @Ladsgroup wrote: > Sounds good to me but access is always limited to DBAs (if you mean who DBAs can hand it over to, it gets complicated in some cases) I did me... [15:51:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T348183)', diff saved to https://phabricator.wikimedia.org/P53343 and previous config saved to /var/cache/conftool/dbconfig/20231113-155149-arnaudb.json [15:51:54] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:52:55] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:53:06] 10SRE-swift-storage: "Original file" seems to be missing at Commons for an image - https://phabricator.wikimedia.org/T72416 (10MatthewVernon) 05Open→03Invalid Yes, I think we might as well close this task. [15:53:54] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:54:19] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:54:49] (03CR) 10Ebernhardson: [C: 03+1] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973242 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:55:30] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: puppetlabs: create puppet 7 environment in WMCS to test code - https://phabricator.wikimedia.org/T294841 (10jbond) 05In progress→03Resolved this is available in the puppet-dev project [15:55:37] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1031.eqiad.wmnet with OS bookworm [15:55:49] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10MoritzMuehlenhoff) [15:57:02] (03PS1) 10Hnowlan: kubernetes::worker: add mw-jobrunner to pools [puppet] - 10https://gerrit.wikimedia.org/r/973824 (https://phabricator.wikimedia.org/T349796) [15:57:21] (03PS4) 10Ladsgroup: Add mediamoderation_scan to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/973809 (https://phabricator.wikimedia.org/T350321) (owner: 10Dreamy Jazz) [15:57:28] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add mediamoderation_scan to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/973809 (https://phabricator.wikimedia.org/T350321) (owner: 10Dreamy Jazz) [15:57:30] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:58] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099 (10joanna_borun) [15:59:11] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/444/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [16:01:05] (03PS1) 10Hnowlan: service: move mw-jobrunner to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/973825 (https://phabricator.wikimedia.org/T349796) [16:01:48] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Issues which should be fixed by puppet7 upgrade - https://phabricator.wikimedia.org/T351104 (10jbond) [16:01:57] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10joanna_borun) [16:02:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe: Update puppet code to conform to puppet 4.x and later standards - https://phabricator.wikimedia.org/T181967 (10jbond) [16:02:30] (03CR) 10DCausse: "done in wikikube" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [16:02:57] (03CR) 10DCausse: [C: 03+1] Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking) [16:03:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-Joe: Disable hiera autolookups - https://phabricator.wikimedia.org/T181971 (10jbond) 05Open→03Declined im going to close this as its [[ https://phabricator.wikimedia.org/T181971#5967526 | no longer possible ]] [16:03:32] (03CR) 10DCausse: [C: 03+1] "should this be rebased on top of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972725 ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973242 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [16:03:39] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10joanna_borun) a:05jbond→03cmooney [16:03:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099 (10jbond) [16:03:43] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Issues which should be fixed by puppet7 upgrade - https://phabricator.wikimedia.org/T351104 (10jbond) [16:05:37] (03PS2) 10Giuseppe Lavagetto: mobileapps: switch canary to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973181 (https://phabricator.wikimedia.org/T350846) [16:05:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [16:05:54] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Create backups for puppetservers - https://phabricator.wikimedia.org/T347390 (10jbond) 05Open→03Resolved a:03jbond This is set up now [16:06:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): spicerack: update spicerack to work with the newer puppet infrastructure - https://phabricator.wikimedia.org/T341496 (10Volans) Update: for the production side of things this is completed. Leaving ope... [16:06:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P53344 and previous config saved to /var/cache/conftool/dbconfig/20231113-160656-arnaudb.json [16:07:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: switch canary to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973181 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [16:08:24] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10joanna_borun) [16:08:47] (03Merged) 10jenkins-bot: mobileapps: switch canary to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973181 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [16:09:25] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10jhathaway) a:03jhathaway [16:09:43] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:53] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage [16:10:29] (03Abandoned) 10DCausse: rdf-streaming-updater: add a "wcqs" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [16:11:11] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:11:57] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:12:53] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage [16:13:45] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:13:58] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [16:14:00] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:14:25] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [16:14:56] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) 05In progress→03Resolved a:03jbond Theses issues are all resolved [16:15:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Create cookbook to migrate servers from the puppetmasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10jbond) 05Open→03Resolved a:03jbond this is complete [16:15:20] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [16:15:34] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Traffic, and 2 others: find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) 05In progress→03Resolved a:03jbond This is in place now use hiera key during migration [16:16:13] (03CR) 10JMeybohm: [C: 03+1] api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [16:20:36] (03CR) 10Ladsgroup: [C: 03+1] use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [16:21:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:22:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P53345 and previous config saved to /var/cache/conftool/dbconfig/20231113-162202-arnaudb.json [16:23:26] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): puppet7: drop instances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10jhathaway) [16:23:51] (03PS1) 10Hnowlan: service: move mw-jobrunner to prod, enable paging [puppet] - 10https://gerrit.wikimedia.org/r/973827 (https://phabricator.wikimedia.org/T349796) [16:24:45] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) 05Open→03In progress [16:24:48] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [16:24:57] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:13] those two mw-jobrunner issues are me, looking into it [16:27:34] (03PS1) 10Andrew Bogott: Prepare cloudvirt1025-1030 for decom [puppet] - 10https://gerrit.wikimedia.org/r/973828 (https://phabricator.wikimedia.org/T351010) [16:27:36] (03PS1) 10Andrew Bogott: Remove mentions of decom'd cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/973829 (https://phabricator.wikimedia.org/T351010) [16:27:42] (03CR) 10JMeybohm: [C: 04-1] "helmfile.d/services/api-gateway/values-staging.yaml is missing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [16:28:06] 10SRE, 10Infrastructure-Foundations, 10conftool: confd fails to start after a reimage - https://phabricator.wikimedia.org/T244477 (10joanna_borun) [16:29:29] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973830 (https://phabricator.wikimedia.org/T128546) [16:30:05] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1630). [16:30:47] (03PS1) 10Giuseppe Lavagetto: mobileapps: fix routing for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/973831 [16:31:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:31:20] ACKNOWLEDGEMENT - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service Hnowlan Will be fixed when service moves to production in LVS https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:20] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly Hnowlan Will be fixed when service moves to production in LVS https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:31:20] ACKNOWLEDGEMENT - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service Hnowlan Will be fixed when service moves to production in LVS https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:20] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly Hnowlan Will be fixed when service moves to production in LVS https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:31:36] (03PS1) 10BCornwall: pybal-test: Don't remove Python 2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/973832 (https://phabricator.wikimedia.org/T351084) [16:32:08] (03PS2) 10JMeybohm: api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [16:32:10] (03PS6) 10JMeybohm: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [16:32:12] (03PS7) 10JMeybohm: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) [16:32:34] (03CR) 10Ssingh: [C: 03+1] pybal-test: Don't remove Python 2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/973832 (https://phabricator.wikimedia.org/T351084) (owner: 10BCornwall) [16:32:50] (03PS2) 10Giuseppe Lavagetto: mobileapps: fix routing for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/973831 [16:33:25] (03CR) 10Muehlenhoff: [C: 03+1] pybal-test: Don't remove Python 2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/973832 (https://phabricator.wikimedia.org/T351084) (owner: 10BCornwall) [16:35:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: fix routing for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/973831 (owner: 10Giuseppe Lavagetto) [16:35:42] (03CR) 10BCornwall: [C: 03+2] pybal-test: Don't remove Python 2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/973832 (https://phabricator.wikimedia.org/T351084) (owner: 10BCornwall) [16:35:48] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973830 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:36:22] (03Merged) 10jenkins-bot: mobileapps: fix routing for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/973831 (owner: 10Giuseppe Lavagetto) [16:36:31] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973830 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:37:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T348183)', diff saved to https://phabricator.wikimedia.org/P53346 and previous config saved to /var/cache/conftool/dbconfig/20231113-163709-arnaudb.json [16:37:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:37:15] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:37:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:37:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T348183)', diff saved to https://phabricator.wikimedia.org/P53347 and previous config saved to /var/cache/conftool/dbconfig/20231113-163730-arnaudb.json [16:38:20] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] [apt_repo] Ensure that parent directories of basedir exist [puppet] - 10https://gerrit.wikimedia.org/r/973778 (owner: 10EoghanGaffney) [16:38:35] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] [apt_repo] Ensure that parent directories of basedir exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973778 (owner: 10EoghanGaffney) [16:39:09] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet [16:39:59] (03CR) 10Dzahn: [C: 03+2] "ACK, thank you, IF" [puppet] - 10https://gerrit.wikimedia.org/r/972909 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [16:40:28] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1031.eqiad.wmnet with OS bookworm [16:40:41] RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:38] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: delete hiera hosts file for deleted hosts [puppet] - 10https://gerrit.wikimedia.org/r/973211 (owner: 10Dzahn) [16:41:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T348183)', diff saved to https://phabricator.wikimedia.org/P53348 and previous config saved to /var/cache/conftool/dbconfig/20231113-164152-arnaudb.json [16:42:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:28] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:43:13] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:43:41] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:43:59] (PuppetFailure) resolved: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:43:59] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:44:37] PROBLEM - Check systemd state on ganeti2012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:25] (03PS2) 10Jcrespo: RemoteExecution: Add comments and a fix a few lint errors [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) [16:46:06] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:973830| Bumping portals to master (T128546)]] (duration: 06m 14s) [16:46:10] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:49:32] (03PS1) 10Ottomata: envoy.yaml - Add retries for mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/973835 (https://phabricator.wikimedia.org/T326002) [16:51:04] (03PS1) 10Ottomata: eventgate-* - use mw-api-int-async-ro for EventStreamConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/973836 (https://phabricator.wikimedia.org/T326002) [16:51:49] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:973830| Bumping portals to master (T128546)]] (duration: 05m 42s) [16:51:53] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:52:11] (03CR) 10Jcrespo: "I've applied all but one, as that may be a bug on implementation, not on the comment itself. I am not sure yet." [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [16:52:50] (03PS1) 10Ottomata: refine - bump to refinery version 0.2.25 to pick up JsonSchemaConverter change [puppet] - 10https://gerrit.wikimedia.org/r/973837 (https://phabricator.wikimedia.org/T321854) [16:53:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy.yaml - Add retries for mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/973835 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [16:54:21] (03CR) 10Ottomata: [C: 03+2] envoy.yaml - Add retries for mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/973835 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [16:55:05] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:56:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P53349 and previous config saved to /var/cache/conftool/dbconfig/20231113-165659-arnaudb.json [16:57:19] (03PS1) 10Jbond: idp: move CA to /etc/ssl/certs/wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/973839 (https://phabricator.wikimedia.org/T340741) [16:57:21] (03PS1) 10Jbond: openstack: update to use multiroot CA [puppet] - 10https://gerrit.wikimedia.org/r/973840 (https://phabricator.wikimedia.org/T340741) [16:57:23] (03PS1) 10Jbond: toolforge: update to use trsuted ca path [puppet] - 10https://gerrit.wikimedia.org/r/973841 (https://phabricator.wikimedia.org/T340741) [16:57:25] (03PS1) 10Jbond: wmcs::kubeadm: migrate to trusted ca path [puppet] - 10https://gerrit.wikimedia.org/r/973842 (https://phabricator.wikimedia.org/T340741) [16:57:27] (03PS1) 10Jbond: webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) [16:57:58] !log otto@deploy2002 Started deploy [analytics/refinery@25ef91f]: deploying refinery with refinery-source 0.2.25 jars for T321854 [analytics/refinery@25ef91f2] [16:58:02] T321854: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 [17:00:26] (03PS3) 10Jcrespo: RemoteExecution: Add comments and fix a few lint errors [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) [17:00:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10Volans) a:03Volans [17:01:27] (03CR) 10Ottomata: [C: 03+2] eventgate-* - use mw-api-int-async-ro for EventStreamConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/973836 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [17:02:36] (03Merged) 10jenkins-bot: eventgate-* - use mw-api-int-async-ro for EventStreamConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/973836 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [17:03:43] (03CR) 10CI reject: [V: 04-1] webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [17:04:34] !log otto@deploy2002 Finished deploy [analytics/refinery@25ef91f]: deploying refinery with refinery-source 0.2.25 jars for T321854 [analytics/refinery@25ef91f2] (duration: 06m 36s) [17:04:43] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [17:04:44] T321854: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 [17:05:13] !log deploying eventgates to pick up change to use mw-api-int-async-ro with retries - T326002 [17:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:30] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [17:05:37] T326002: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 [17:06:15] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [17:06:34] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [17:08:47] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [17:09:03] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [17:09:11] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [17:09:30] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [17:09:43] PROBLEM - ensure kvm processes are running on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:10:47] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [17:12:03] (03PS2) 10Eevans: cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993) [17:12:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P53350 and previous config saved to /var/cache/conftool/dbconfig/20231113-171205-arnaudb.json [17:12:17] (03PS1) 10Ottomata: eventstreams* - use mw-api-int-async-ro for EventStreamConfig: [deployment-charts] - 10https://gerrit.wikimedia.org/r/973846 (https://phabricator.wikimedia.org/T326002) [17:12:46] (03PS1) 10FNegri: [toolsdb] Lower innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/973847 (https://phabricator.wikimedia.org/T349695) [17:13:02] (03PS5) 10D3r1ck01: mc: Read mcrouter servers from an environment variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) [17:13:57] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [17:14:27] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [17:14:41] (03PS6) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) [17:15:22] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [17:16:49] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [17:17:03] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [17:17:13] (03CR) 10Jbond: [C: 03+2] idp: move CA to /etc/ssl/certs/wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/973839 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [17:17:38] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [17:18:37] (03PS2) 10Jbond: webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) [17:20:53] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [17:21:27] (03CR) 10CI reject: [V: 04-1] webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [17:21:33] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [17:21:38] (03CR) 10Ottomata: [C: 03+2] eventstreams* - use mw-api-int-async-ro for EventStreamConfig: [deployment-charts] - 10https://gerrit.wikimedia.org/r/973846 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [17:21:47] RECOVERY - ensure kvm processes are running on cloudvirt1031 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:21:55] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [17:22:41] (03Merged) 10jenkins-bot: eventstreams* - use mw-api-int-async-ro for EventStreamConfig: [deployment-charts] - 10https://gerrit.wikimedia.org/r/973846 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [17:26:26] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [17:26:43] (03PS3) 10Jbond: webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) [17:27:11] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [17:27:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T348183)', diff saved to https://phabricator.wikimedia.org/P53351 and previous config saved to /var/cache/conftool/dbconfig/20231113-172712-arnaudb.json [17:27:15] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:27:18] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:27:28] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:27:29] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:27:43] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:27:48] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [17:27:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T348183)', diff saved to https://phabricator.wikimedia.org/P53352 and previous config saved to /var/cache/conftool/dbconfig/20231113-172748-arnaudb.json [17:28:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.909937787013271s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:28:20] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [17:29:29] (03CR) 10Jbond: "This box is listed as being owned by Observability but please let me know if there are better reviewers to add cheers" [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [17:29:41] (03PS4) 10Jbond: webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) [17:31:13] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [17:31:28] (03PS1) 10Brion VIBBER: Don't change transcode rows during read operations [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973790 (https://phabricator.wikimedia.org/T152851) [17:31:29] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [17:31:59] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [17:32:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T348183)', diff saved to https://phabricator.wikimedia.org/P53353 and previous config saved to /var/cache/conftool/dbconfig/20231113-173209-arnaudb.json [17:32:29] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [17:32:37] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [17:33:15] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [17:33:17] (03CR) 10Ottomata: [C: 03+2] refine - bump to refinery version 0.2.25 to pick up JsonSchemaConverter change [puppet] - 10https://gerrit.wikimedia.org/r/973837 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [17:33:31] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [17:34:09] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [17:34:21] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [17:34:36] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [17:34:56] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [17:35:24] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Urbanecm_WMF) Hello, would someone mind clarifying what is this stalled on please? [17:35:41] (03CR) 10Slyngshede: Ensure that build directories are cleaned up (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [17:35:50] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [17:36:22] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [17:37:04] (03Abandoned) 10Hnowlan: editor-analytics: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972864 (owner: 10Hnowlan) [17:37:56] (03PS1) 10Brion VIBBER: Fixes to requeueTranscodes to make it easier to batch-fill [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973791 (https://phabricator.wikimedia.org/T68722) [17:38:41] (03CR) 10Hnowlan: [C: 03+1] cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [17:39:28] (03PS3) 10Hnowlan: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 [17:40:44] (03CR) 10Raymond Ndibe: prometheus: add build and envvars api metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967963 (https://phabricator.wikimedia.org/T337390) (owner: 10Raymond Ndibe) [17:42:11] RECOVERY - Check systemd state on ganeti2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:54] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Krinkle) [17:47:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P53354 and previous config saved to /var/cache/conftool/dbconfig/20231113-174716-arnaudb.json [17:50:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [17:50:32] (03PS1) 10Ebernhardson: Source mjolnir deploy repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/973849 (https://phabricator.wikimedia.org/T346373) [17:52:48] (03CR) 10Bking: [C: 03+2] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973242 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:53:40] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10Dwisehaupt) Thanks both for your feedback. I've chatted with Jeff and we think we should go ahead and start out with the private IP behind the C... [17:54:29] (03CR) 10Bking: [C: 03+1] Source mjolnir deploy repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/973849 (https://phabricator.wikimedia.org/T346373) (owner: 10Ebernhardson) [17:55:13] (03Merged) 10jenkins-bot: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973242 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:59:01] !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:59:09] !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:59:16] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:59:53] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-host for host people2003.codfw.wmnet [18:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1800) [18:00:07] ryankemper: (Dis)respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T1800). Please do the needful. [18:01:31] (03PS1) 10Dzahn: people2003: migrate to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973852 [18:02:00] (03CR) 10Dzahn: [C: 03+2] people2003: migrate to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973852 (owner: 10Dzahn) [18:02:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P53355 and previous config saved to /var/cache/conftool/dbconfig/20231113-180222-arnaudb.json [18:03:01] (03PS2) 10Majavah: prometheus: add build and envvars api metrics [puppet] - 10https://gerrit.wikimedia.org/r/967963 (https://phabricator.wikimedia.org/T337390) (owner: 10Raymond Ndibe) [18:05:24] (03CR) 10Majavah: [C: 03+2] prometheus: add build and envvars api metrics [puppet] - 10https://gerrit.wikimedia.org/r/967963 (https://phabricator.wikimedia.org/T337390) (owner: 10Raymond Ndibe) [18:06:58] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:07:37] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:07:44] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:07:52] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10Dzahn) >>! In T349402#9321478, @Dwisehaupt wrote: > Finally, in regards to a public IP or private/CDN, I'm not 100% certain. To my knowledge, th... [18:07:54] !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:08:00] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:09:12] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:09:19] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:16:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host people2003.codfw.wmnet [18:17:04] (03CR) 10Kosta Harlan: [C: 03+1] Use local DB when on betawikis for 'virtual-mediamoderation' domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973816 (https://phabricator.wikimedia.org/T351096) (owner: 10Dreamy Jazz) [18:17:07] 10SRE, 10Acme-chief, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10BCornwall) 05In progress→03Resolved [18:17:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T348183)', diff saved to https://phabricator.wikimedia.org/P53356 and previous config saved to /var/cache/conftool/dbconfig/20231113-181729-arnaudb.json [18:17:31] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [18:17:33] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:17:45] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [18:17:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53357 and previous config saved to /var/cache/conftool/dbconfig/20231113-181751-arnaudb.json [18:18:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Dzahn) @Urbanecm As the new group approver for this new group would you approve that guy @Urbanecm? [18:20:04] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-role for role: microsites::peopleweb [18:20:57] (03PS1) 10Dzahn: peopleweb: migrate role to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973855 [18:22:01] (03CR) 10Dzahn: [C: 03+1] "@jbond fyi, no issues on the inactive host" [puppet] - 10https://gerrit.wikimedia.org/r/973855 (owner: 10Dzahn) [18:22:25] (03CR) 10Dzahn: [C: 03+2] peopleweb: migrate role to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973855 (owner: 10Dzahn) [18:23:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53358 and previous config saved to /var/cache/conftool/dbconfig/20231113-182308-arnaudb.json [18:23:13] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:23:26] (03CR) 10Dzahn: [C: 03+2] "well, funny enough, on the NEXT puppet run I do see an issue now.." [puppet] - 10https://gerrit.wikimedia.org/r/973855 (owner: 10Dzahn) [18:24:07] (03CR) 10Dzahn: [C: 03+2] "Error while evaluating a Function Call, No such file or directory - /srv/puppet_code/environments/production/modules/wmflib/lib/puppet/fun" [puppet] - 10https://gerrit.wikimedia.org/r/973855 (owner: 10Dzahn) [18:28:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: microsites::peopleweb [18:38:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P53359 and previous config saved to /var/cache/conftool/dbconfig/20231113-183814-arnaudb.json [18:42:47] (03CR) 10Dzahn: [C: 03+2] "on another puppet run the error is gone ... odd ?" [puppet] - 10https://gerrit.wikimedia.org/r/973855 (owner: 10Dzahn) [18:42:55] !log pool cp4052 as first cp host for bookworm testing: T342154 [18:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:59] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [18:47:14] (03CR) 10Vgutierrez: [C: 03+1] mtail: Record bad requests for ATS SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966930 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [18:47:20] (03CR) 10Vgutierrez: [C: 03+1] mtail: Record bad requests for HAProxy SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966918 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [18:49:43] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:49:47] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:49:55] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:49:59] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:50:07] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:50:13] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:53:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P53360 and previous config saved to /var/cache/conftool/dbconfig/20231113-185321-arnaudb.json [18:53:54] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:42] (03PS2) 10DDesouza: Undeploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973808 (https://phabricator.wikimedia.org/T349854) [19:00:07] (03PS1) 10Brion VIBBER: Only include completed transcodes in .m3u8 playlist [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973793 (https://phabricator.wikimedia.org/T350996) [19:00:40] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bookworm [19:08:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53361 and previous config saved to /var/cache/conftool/dbconfig/20231113-190827-arnaudb.json [19:08:30] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [19:08:32] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:08:43] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [19:08:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T348183)', diff saved to https://phabricator.wikimedia.org/P53362 and previous config saved to /var/cache/conftool/dbconfig/20231113-190849-arnaudb.json [19:10:28] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Dzahn) This task has been stalled since August. As far as I can tell we are still waiting for a new SSH key. Any updates on that? [19:11:33] 10SRE, 10SRE-Access-Requests: Deployment access for Search Platform SWE on Flink WDQS and Search pipelines - https://phabricator.wikimedia.org/T347560 (10Dzahn) a:03Gehel [19:12:46] 10SRE, 10SRE-Access-Requests: Deployment access for Search Platform SWE on Flink WDQS and Search pipelines - https://phabricator.wikimedia.org/T347560 (10EBernhardson) 05Stalled→03Declined The basic rights all seem to be available, if anything comes up we can revisit. [19:13:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Dzahn) a:03DMburugu [19:13:39] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10thcipriani) >>! In T350779#9326278, @MatthewVernon wrote: > @thcipriani you're listed as the approver for the `r... [19:13:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T348183)', diff saved to https://phabricator.wikimedia.org/P53363 and previous config saved to /var/cache/conftool/dbconfig/20231113-191354-arnaudb.json [19:13:55] 10SRE, 10SRE-Access-Requests: Requesting access to WMF for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10Dzahn) a:03Jdforrester-WMF [19:14:00] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:14:04] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10Dzahn) a:03thcipriani [19:15:58] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1032.eqiad.wmnet with OS bookworm [19:17:16] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-host for host stewards1001.eqiad.wmnet [19:18:29] (03PS1) 10Dzahn: stewards: migrate stewards1001 to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/973862 [19:18:59] (03CR) 10Dzahn: [C: 03+2] stewards: migrate stewards1001 to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/973862 (owner: 10Dzahn) [19:19:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm [19:20:25] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10Dzahn) a:05thcipriani→03None [19:20:45] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bookworm [19:23:36] (03PS1) 10Dzahn: stewards: migrate role to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973863 [19:24:29] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host stewards1001.eqiad.wmnet [19:24:43] (SystemdUnitFailed) firing: puppet-agent-timer.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:10] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-host for host stewards1001.eqiad.wmnet [19:26:21] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P53364 and previous config saved to /var/cache/conftool/dbconfig/20231113-192900-arnaudb.json [19:35:43] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1033.eqiad.wmnet with OS bookworm [19:37:21] (03PS2) 10Andrew Bogott: Prepare cloudvirt1025-1030 for decom [puppet] - 10https://gerrit.wikimedia.org/r/973828 (https://phabricator.wikimedia.org/T351010) [19:37:23] (03PS2) 10Andrew Bogott: Remove mentions of decom'd cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/973829 (https://phabricator.wikimedia.org/T351010) [19:37:49] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1032.eqiad.wmnet with OS bookworm [19:38:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bookworm [19:38:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm [19:40:57] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudvirt1025-1030 for decom [puppet] - 10https://gerrit.wikimedia.org/r/973828 (https://phabricator.wikimedia.org/T351010) (owner: 10Andrew Bogott) [19:44:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P53365 and previous config saved to /var/cache/conftool/dbconfig/20231113-194406-arnaudb.json [19:46:35] (03CR) 10Andrew Bogott: [C: 03+2] mariadb - wmcs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [19:47:59] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host stewards1001.eqiad.wmnet with OS bookworm [19:49:00] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet wi... [19:49:43] (SystemdUnitFailed) resolved: puppet-agent-timer.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:52:44] (03PS52) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [19:53:16] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1033.eqiad.wmnet with OS bookworm [19:53:54] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:55:19] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:55:22] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:55:32] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:55:38] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:56:30] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm [19:56:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage [19:59:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T348183)', diff saved to https://phabricator.wikimedia.org/P53366 and previous config saved to /var/cache/conftool/dbconfig/20231113-195913-arnaudb.json [19:59:15] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:59:21] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:59:28] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:59:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53367 and previous config saved to /var/cache/conftool/dbconfig/20231113-195934-arnaudb.json [19:59:47] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1032.eqiad.wmnet with OS bookworm [20:00:22] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bookworm [20:00:42] o/ [20:01:07] jouncebot: now [20:01:07] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [20:01:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage [20:01:35] daylight savings time change? [20:03:08] probably [20:03:43] xD [20:04:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53368 and previous config saved to /var/cache/conftool/dbconfig/20231113-200451-arnaudb.json [20:05:05] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:09:21] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1033.eqiad.wmnet with OS bookworm [20:09:46] (03PS1) 10Jdrewniak: Deploy Vector 2022 Zebra refactor to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973865 (https://phabricator.wikimedia.org/T347711) [20:11:34] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) Any updates on this? [20:12:28] (03PS53) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [20:14:28] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage [20:15:05] !log start reindex of enwiki indexes in cloudelastic search cluster from mwmaint2002 [20:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:35] (03CR) 10Eevans: [V: 03+2 C: 03+2] cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [20:17:27] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage [20:17:55] (03CR) 10BCornwall: [C: 03+2] mtail: Record bad requests for ATS SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966930 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [20:18:07] (03CR) 10BCornwall: [C: 03+2] mtail: Record bad requests for HAProxy SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966918 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [20:18:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm [20:19:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P53369 and previous config saved to /var/cache/conftool/dbconfig/20231113-201957-arnaudb.json [20:27:20] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1025.eqiad.wmnet [20:27:21] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [20:27:32] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [20:32:17] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:32:59] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [20:33:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 5.04867441914672s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:34:01] (03PS1) 10Kosta Harlan: ipoid: Disable initial-import job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973866 [20:34:03] (03PS1) 10Kosta Harlan: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 [20:34:25] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM,thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [20:34:34] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:35:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P53370 and previous config saved to /var/cache/conftool/dbconfig/20231113-203504-arnaudb.json [20:35:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:35:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:35:48] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1025.eqiad.wmnet [20:36:00] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1026.eqiad.wmnet [20:36:17] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [20:36:45] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [20:37:22] (03PS1) 10Ottomata: Revert "refine - bump to refinery version 0.2.25 to pick up JsonSchemaConverter change" [puppet] - 10https://gerrit.wikimedia.org/r/973796 [20:37:55] (03PS1) 10Subramanya Sastry: Parsoid-VE-MCR hack: Always return main slot output if useParsoid is set [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973797 (https://phabricator.wikimedia.org/T351026) [20:38:50] (03PS2) 10Ottomata: Revert "refine - bump to refinery version 0.2.25 to pick up JsonSchemaConverter change" [puppet] - 10https://gerrit.wikimedia.org/r/973796 [20:39:24] (03CR) 10Tchanders: [C: 03+1] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan) [20:39:58] (03CR) 10Jdrewniak: [C: 03+2] Deploy Vector 2022 Zebra refactor to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973865 (https://phabricator.wikimedia.org/T347711) (owner: 10Jdrewniak) [20:40:38] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Revert "refine - bump to refinery version 0.2.25 to pick up JsonSchemaConverter change" [puppet] - 10https://gerrit.wikimedia.org/r/973796 (owner: 10Ottomata) [20:40:41] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:40:41] (03Merged) 10jenkins-bot: Deploy Vector 2022 Zebra refactor to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973865 (https://phabricator.wikimedia.org/T347711) (owner: 10Jdrewniak) [20:41:10] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1032.eqiad.wmnet with OS bookworm [20:41:38] (03CR) 10Tchanders: [C: 03+1] ipoid: Disable initial-import job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973866 (owner: 10Kosta Harlan) [20:42:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:50] (03PS2) 10Kosta Harlan: ipoid: Disable initial-import job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973866 [20:42:54] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:42:54] (03CR) 10C. Scott Ananian: [C: 03+1] Parsoid-VE-MCR hack: Always return main slot output if useParsoid is set [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973797 (https://phabricator.wikimedia.org/T351026) (owner: 10Subramanya Sastry) [20:42:57] (03CR) 10Kosta Harlan: ipoid: Disable initial-import job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973866 (owner: 10Kosta Harlan) [20:43:01] (03PS2) 10Kosta Harlan: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 [20:43:08] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1034.eqiad.wmnet with OS bookworm [20:43:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:43:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:43:59] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1026.eqiad.wmnet [20:44:12] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1027.eqiad.wmnet [20:46:17] (03PS1) 10DLynch: Enable edit check on swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973870 [20:46:43] (03PS1) 10BCornwall: slo_definitions: Switch to using haproxy_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) [20:47:18] !log mwmaint2002: `mwscript extensions/GrowthExperiments/maintenance/reassignMentees.php --wiki=arwiki --all --performer='Martin Urbanec (WMF)'` (T330071) [20:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:29] T330071: Mentorship: ensure that all mentees are assigned to an active mentor - https://phabricator.wikimedia.org/T330071 [20:49:07] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:50:07] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T348183)', diff saved to https://phabricator.wikimedia.org/P53371 and previous config saved to /var/cache/conftool/dbconfig/20231113-205010-arnaudb.json [20:50:13] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:50:15] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:50:26] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:50:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T348183)', diff saved to https://phabricator.wikimedia.org/P53372 and previous config saved to /var/cache/conftool/dbconfig/20231113-205032-arnaudb.json [20:51:10] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1027.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:51:15] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1022.eqiad.wmnet with reason: T347504 [20:51:19] T347504: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 [20:51:29] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1022.eqiad.wmnet with reason: T347504 [20:52:02] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T347504 [20:52:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T347504 [20:52:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1027.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:52:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:52:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1027.eqiad.wmnet [20:52:30] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1023.eqiad.wmnet with reason: T347504 [20:52:54] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1023.eqiad.wmnet with reason: T347504 [20:53:10] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1028.eqiad.wmnet [20:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.864128550585154s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:55:27] (03PS1) 10Btullis: Repool clouddb10[17-20] post maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973799 (https://phabricator.wikimedia.org/T344590) [20:55:43] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973799 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [20:55:55] (03CR) 10CI reject: [V: 04-1] Repool clouddb10[17-20] post maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973799 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [20:56:30] (03PS2) 10Btullis: Repool clouddb10[17-20] post maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973799 (https://phabricator.wikimedia.org/T344590) [20:58:31] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:58:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T348183)', diff saved to https://phabricator.wikimedia.org/P53373 and previous config saved to /var/cache/conftool/dbconfig/20231113-205852-arnaudb.json [20:58:56] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:59:02] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage [20:59:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bookworm [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T2100). [21:00:04] tgr, kemayo, bvibber, danisztls, subbu, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:13] woohoo [21:00:20] hi [21:00:26] a lot of patches! [21:00:31] i can deploy :) [21:00:34] Yo [21:00:35] mine are at least all on the same extension ;) [21:00:37] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/973816 is beta config only; any objections if I +2 it, and get out of your way? [21:00:39] here [21:00:55] (03CR) 10Urbanecm: [C: 03+2] Don't change transcode rows during read operations [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973790 (https://phabricator.wikimedia.org/T152851) (owner: 10Brion VIBBER) [21:00:56] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1028.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [21:01:03] (03CR) 10Urbanecm: [C: 03+2] Fixes to requeueTranscodes to make it easier to batch-fill [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973791 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER) [21:01:06] o/ [21:01:11] (03CR) 10Urbanecm: [C: 03+2] Only include completed transcodes in .m3u8 playlist [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973793 (https://phabricator.wikimedia.org/T350996) (owner: 10Brion VIBBER) [21:01:14] (03CR) 10Urbanecm: [C: 03+2] Use local DB when on betawikis for 'virtual-mediamoderation' domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973816 (https://phabricator.wikimedia.org/T351096) (owner: 10Dreamy Jazz) [21:01:25] kostajh: was about to do that :) [21:01:37] urbanecm: thanks! [21:01:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1028.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [21:01:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:59] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1028.eqiad.wmnet [21:02:08] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973799 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [21:02:11] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1029.eqiad.wmnet [21:02:25] (03Merged) 10jenkins-bot: Use local DB when on betawikis for 'virtual-mediamoderation' domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973816 (https://phabricator.wikimedia.org/T351096) (owner: 10Dreamy Jazz) [21:02:33] (03PS1) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) [21:02:39] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage [21:03:09] (03PS2) 10DLynch: Enable edit check on swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973870 (https://phabricator.wikimedia.org/T350921) [21:03:16] kostajh: done [21:03:41] (Realized I forgot to put the ticket in the commit message for mine.) [21:03:45] (03PS3) 10Urbanecm: Enable edit check on swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973870 (https://phabricator.wikimedia.org/T350921) (owner: 10DLynch) [21:03:48] (03CR) 10Urbanecm: [C: 03+2] Enable edit check on swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973870 (https://phabricator.wikimedia.org/T350921) (owner: 10DLynch) [21:04:06] RECOVERY - Check systemd state on apt-staging2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:04:34] (03Merged) 10jenkins-bot: Enable edit check on swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973870 (https://phabricator.wikimedia.org/T350921) (owner: 10DLynch) [21:04:36] (03PS2) 10Urbanecm: Fix Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973814 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [21:04:52] (03CR) 10Urbanecm: [C: 03+2] Fix Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973814 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [21:05:13] (03PS3) 10Urbanecm: Undeploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973808 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [21:05:19] (03CR) 10Urbanecm: [C: 03+2] Undeploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973808 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [21:05:35] (03Merged) 10jenkins-bot: Fix Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973814 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [21:05:54] (03CR) 10CI reject: [V: 04-1] Undeploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973808 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [21:06:13] (03PS4) 10Urbanecm: Undeploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973808 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [21:06:27] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973870|Enable edit check on swwiki (T350921)]], [[gerrit:973814|Fix Reader Demographics 2 survey (T345951)]] [21:06:45] T350921: [Config] Enable Edit Check (References) at sw.wiki - https://phabricator.wikimedia.org/T350921 [21:06:45] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [21:06:48] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [21:07:46] !log urbanecm@deploy2002 dani and kemayo and urbanecm: Backport for [[gerrit:973870|Enable edit check on swwiki (T350921)]], [[gerrit:973814|Fix Reader Demographics 2 survey (T345951)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:08] danisztls: Kemayo: your patches are at mwdebug2001, can you test? [21:08:29] Sure thing [21:08:33] sure [21:09:06] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [21:09:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:10:29] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [21:10:29] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:10:30] urbanecm: Mine's good. [21:10:30] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1029.eqiad.wmnet [21:10:35] ack, ty [21:10:49] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1030.eqiad.wmnet [21:11:06] urbanecm: 973814 looks good [21:11:11] ack, ty [21:11:12] !log urbanecm@deploy2002 dani and kemayo and urbanecm: Continuing with sync [21:11:16] proceeding [21:12:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:12:40] (03CR) 10Urbanecm: [C: 03+2] Undeploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973808 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [21:13:36] (03Merged) 10jenkins-bot: Undeploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973808 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [21:13:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P53374 and previous config saved to /var/cache/conftool/dbconfig/20231113-211358-arnaudb.json [21:15:39] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [21:16:43] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:973870|Enable edit check on swwiki (T350921)]], [[gerrit:973814|Fix Reader Demographics 2 survey (T345951)]] (duration: 10m 15s) [21:16:55] T350921: [Config] Enable Edit Check (References) at sw.wiki - https://phabricator.wikimedia.org/T350921 [21:16:55] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [21:17:06] and deployed [21:17:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:17:22] 10SRE, 10Bitu, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10bd808) [21:17:36] (03CR) 10Urbanecm: [C: 03+2] Parsoid-VE-MCR hack: Always return main slot output if useParsoid is set [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973797 (https://phabricator.wikimedia.org/T351026) (owner: 10Subramanya Sastry) [21:17:41] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1030.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [21:18:08] (03Merged) 10jenkins-bot: Don't change transcode rows during read operations [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973790 (https://phabricator.wikimedia.org/T152851) (owner: 10Brion VIBBER) [21:18:09] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973808|Undeploy pilot survey on metawiki (T349854)]] [21:18:21] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1035.eqiad.wmnet with OS bookworm [21:18:22] T349854: Deploy pilot survey on metawiki - https://phabricator.wikimedia.org/T349854 [21:18:29] whee [21:18:42] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1030.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [21:18:43] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:18:43] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1030.eqiad.wmnet [21:19:26] !log urbanecm@deploy2002 urbanecm and dani: Backport for [[gerrit:973808|Undeploy pilot survey on metawiki (T349854)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:19:48] (03Merged) 10jenkins-bot: Fixes to requeueTranscodes to make it easier to batch-fill [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973791 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER) [21:19:56] (03Merged) 10jenkins-bot: Only include completed transcodes in .m3u8 playlist [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973793 (https://phabricator.wikimedia.org/T350996) (owner: 10Brion VIBBER) [21:20:48] !log urbanecm@deploy2002 Sync cancelled. [21:21:21] * urbanecm pulls the backports too [21:21:25] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973808|Undeploy pilot survey on metawiki (T349854)]], [[gerrit:973790|Don't change transcode rows during read operations (T152851)]], [[gerrit:973791|Fixes to requeueTranscodes to make it easier to batch-fill (T68722)]], [[gerrit:973793|Only include completed transcodes in .m3u8 playlist (T350996)]] [21:21:35] T152851: TMH should not make DB writes on HTTP GET for its on-the-fly transcode corrections - https://phabricator.wikimedia.org/T152851 [21:21:36] T68722: [iOS app] Some media (esp. video) files do not work - https://phabricator.wikimedia.org/T68722 [21:21:36] T350996: HLS meta playlist .m3u8 includes not-yet-made transcodes - https://phabricator.wikimedia.org/T350996 [21:22:17] (03CR) 10Bking: [C: 03+2] Source mjolnir deploy repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/973849 (https://phabricator.wikimedia.org/T346373) (owner: 10Ebernhardson) [21:22:21] let's gooooooooo :) [21:22:45] !log urbanecm@deploy2002 urbanecm and brion and dani: Backport for [[gerrit:973808|Undeploy pilot survey on metawiki (T349854)]], [[gerrit:973790|Don't change transcode rows during read operations (T152851)]], [[gerrit:973791|Fixes to requeueTranscodes to make it easier to batch-fill (T68722)]], [[gerrit:973793|Only include completed transcodes in .m3u8 playlist (T350996)]] synced to the testservers (https://wikitech.wiki [21:22:45] media.org/wiki/Mwdebug) [21:23:03] bvibber: do you mind testing it? :) [21:23:15] danisztls: and you as well, the undeployment please :) [21:23:20] sure [21:24:12] urbanecm: looks good [21:24:25] ty [21:25:11] (03CR) 10Btullis: [C: 03+2] Repool clouddb10[17-20] post maintenance [puppet] - 10https://gerrit.wikimedia.org/r/973799 (https://phabricator.wikimedia.org/T344590) (owner: 10Btullis) [21:25:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1034.eqiad.wmnet with OS bookworm [21:26:01] yeah it's tough to tell on mine ;) the main change ends up in the job queue :D [21:26:06] but nothing is exploding :D [21:26:27] good enough for me! :) [21:26:36] so, ok to sync? [21:26:41] yep :D [21:26:45] !log urbanecm@deploy2002 urbanecm and brion and dani: Continuing with sync [21:26:55] just making sure. doing! :) [21:27:03] hehe [21:27:07] thanks ;) [21:28:02] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [21:29:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P53375 and previous config saved to /var/cache/conftool/dbconfig/20231113-212904-arnaudb.json [21:29:10] tgr|away: hi, you around for b&c? :) [21:30:07] (03CR) 10Herron: [C: 03+1] "Thanks for putting this together" [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [21:30:41] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1036.eqiad.wmnet with OS bookworm [21:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [21:32:03] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:973808|Undeploy pilot survey on metawiki (T349854)]], [[gerrit:973790|Don't change transcode rows during read operations (T152851)]], [[gerrit:973791|Fixes to requeueTranscodes to make it easier to batch-fill (T68722)]], [[gerrit:973793|Only include completed transcodes in .m3u8 playlist (T350996)]] (duration: 10m 37s) [21:32:11] T349854: Deploy pilot survey on metawiki - https://phabricator.wikimedia.org/T349854 [21:32:12] T152851: TMH should not make DB writes on HTTP GET for its on-the-fly transcode corrections - https://phabricator.wikimedia.org/T152851 [21:32:12] T68722: [iOS app] Some media (esp. video) files do not work - https://phabricator.wikimedia.org/T68722 [21:32:12] T350996: HLS meta playlist .m3u8 includes not-yet-made transcodes - https://phabricator.wikimedia.org/T350996 [21:32:53] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [21:33:04] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [21:33:56] (03PS1) 10DDesouza: Increase coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973876 (https://phabricator.wikimedia.org/T344393) [21:34:06] o/ [21:34:24] (03PS4) 10Urbanecm: mobile: Add MobileUrlCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969401 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [21:34:25] sorry, had some travel-related distractions [21:34:29] (03CR) 10Urbanecm: [C: 03+2] mobile: Add MobileUrlCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969401 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [21:34:40] hrm, not 100% sure this went right. lemme double-check [21:34:40] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage [21:34:40] no worries. just in time :) [21:35:21] (03Merged) 10jenkins-bot: mobile: Add MobileUrlCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969401 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [21:35:40] (03Merged) 10jenkins-bot: Parsoid-VE-MCR hack: Always return main slot output if useParsoid is set [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973797 (https://phabricator.wikimedia.org/T351026) (owner: 10Subramanya Sastry) [21:35:58] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:969401|mobile: Add MobileUrlCallback (T257852)]], [[gerrit:973797|Parsoid-VE-MCR hack: Always return main slot output if useParsoid is set (T351026 T351113)]] [21:36:06] T257852: CentralAuth edge login and autologin for some Wikimedia domains broken on mobile - https://phabricator.wikimedia.org/T257852 [21:36:06] T351026: VisualEditor adding nonsense code to file pages - https://phabricator.wikimedia.org/T351026 [21:36:06] T351113: Figure out how Parsoid will work with MCR slots to support both reading and editing clients - https://phabricator.wikimedia.org/T351113 [21:36:23] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [21:36:38] !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [21:36:51] is there time for one more config patch? the backport window is loaded, but seems like it's mostly done already? [21:37:17] !log urbanecm@deploy2002 urbanecm and ssastry and tgr: Backport for [[gerrit:969401|mobile: Add MobileUrlCallback (T257852)]], [[gerrit:973797|Parsoid-VE-MCR hack: Always return main slot output if useParsoid is set (T351026 T351113)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:37:19] tgr: sure. happy to transfer to you once the last sync finishes [21:37:33] tgr: subbu: your patch's at mwdebug2001, please test :) [21:37:38] thx [21:37:43] thanks. will do. [21:37:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage [21:40:18] !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [21:41:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Urbanecm) [21:41:21] so the symptom i'm seeing is that resetting video transcodes removes them but doesn't seem to start the new job, and i can't repro it on my local copy with the same version of the extension checked out [21:41:38] lemme check... [21:43:45] yeah i have no idea why it's not failing and i can't see anyting obvious in logstash [21:43:49] *why it's failing [21:43:54] *sigh* [21:44:01] why it's not failing would be better i guess :) [21:44:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T348183)', diff saved to https://phabricator.wikimedia.org/P53376 and previous config saved to /var/cache/conftool/dbconfig/20231113-214411-arnaudb.json [21:44:14] urbanecm, looks good .. https://commons.wikimedia.org/w/index.php?title=File%3ALord_Bishnu-Shesh_Narayan.JPG&diff=821497259&oldid=817541086 .. (I reverted that test edit). [21:44:16] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:44:21] ty subbu [21:44:31] but, that uncovered a different issue which i will investigate and fix separately once this is done. [21:44:36] heh [21:44:41] :) [21:44:43] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:55] tgr: what about your patch? [21:45:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage [21:46:19] !log bking@deploy2002 deploy mjolnir 2.4.0 on newly-built bullseye hosts T346039 [21:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:23] T346039: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 [21:46:52] urbanecm: sorry, takes a while to check [21:47:08] ok, no worries. just checking. [21:47:52] !log bking@deploy2002 Started deploy [search/mjolnir/deploy@0f8bb60]: (no justification provided) [21:48:27] !log bking@deploy2002 Finished deploy [search/mjolnir/deploy@0f8bb60]: (no justification provided) (duration: 00m 35s) [21:48:38] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage [21:48:54] urbanecm: all good I think [21:49:01] ack, confirming [21:49:03] !log urbanecm@deploy2002 urbanecm and ssastry and tgr: Continuing with sync [21:53:14] Added one more patch to the window (not config, I mixed it up). I can deploy that one. [21:54:03] *ugh* i bet it's a db lag issue and it's unrelated to the patch deployment [21:54:24] this particular view is known to be fragile, if i let the thing run it runs to completion :D [21:54:32] just means i have another bug to fix later lol [21:54:33] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:969401|mobile: Add MobileUrlCallback (T257852)]], [[gerrit:973797|Parsoid-VE-MCR hack: Always return main slot output if useParsoid is set (T351026 T351113)]] (duration: 18m 34s) [21:54:40] T257852: CentralAuth edge login and autologin for some Wikimedia domains broken on mobile - https://phabricator.wikimedia.org/T257852 [21:54:40] T351026: VisualEditor adding nonsense code to file pages - https://phabricator.wikimedia.org/T351026 [21:54:40] T351113: Figure out how Parsoid will work with MCR slots to support both reading and editing clients - https://phabricator.wikimedia.org/T351113 [21:54:44] and live [21:54:48] tgr: window's yours [21:55:48] thanks! [21:55:59] bam, yeah i see what happened. we're no longer forcing a sync to the primary's position on the replica connection on the post-view screen and it ends up failing to cache correctly [21:56:10] this is a safe failure mode for now, it's just annoying and i know how to fix it :D [21:56:39] no need to roll back. [21:57:07] ack, sounds good! [21:58:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.572006493601591s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:00:04] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231113T2200) [22:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [22:05:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1035.eqiad.wmnet with OS bookworm [22:10:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1037.eqiad.wmnet with OS bookworm [22:12:09] yeah we also had some extra depth in the queue which exacerbated by caching problem. *phew* mystery solved :D [22:12:15] s/by/my/ [22:12:58] !log brion halting requeueTranscode jobs to let queues even out before continuing with lighter load [22:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:22] (03PS1) 10Gergő Tisza: Remove support for HTTPS-only sessions on HTTP/HTTPS wikis [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973800 (https://phabricator.wikimedia.org/T348852) [22:13:37] (03CR) 10Gergő Tisza: [C: 03+2] Remove support for HTTPS-only sessions on HTTP/HTTPS wikis [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973800 (https://phabricator.wikimedia.org/T348852) (owner: 10Gergő Tisza) [22:17:35] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1036.eqiad.wmnet with OS bookworm [22:18:32] (03Merged) 10jenkins-bot: Remove support for HTTPS-only sessions on HTTP/HTTPS wikis [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973800 (https://phabricator.wikimedia.org/T348852) (owner: 10Gergő Tisza) [22:22:57] !log tgr@deploy2002 Started scap: Backport for [[gerrit:973800|Remove support for HTTPS-only sessions on HTTP/HTTPS wikis (T348852)]] [22:23:01] T348852: Remove CentralAuth support for mixed-protocol HTTP/HTTPS wikis - https://phabricator.wikimedia.org/T348852 [22:23:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1038.eqiad.wmnet with OS bookworm [22:24:18] !log tgr@deploy2002 tgr: Backport for [[gerrit:973800|Remove support for HTTPS-only sessions on HTTP/HTTPS wikis (T348852)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:24:32] 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10mpopov) [22:26:14] 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10mpopov) I approve adding @Hghani (`hghani`) and @OSefu-WMF (`osefu`) to `analytics-product-users` group. [22:27:22] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage [22:32:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage [22:35:52] !log tgr@deploy2002 tgr: Continuing with sync [22:39:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage [22:40:13] (03PS1) 10Gergő Tisza: session: Remove incorrect warning [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973801 (https://phabricator.wikimedia.org/T348852) [22:41:15] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:973800|Remove support for HTTPS-only sessions on HTTP/HTTPS wikis (T348852)]] (duration: 18m 17s) [22:41:19] T348852: Remove CentralAuth support for mixed-protocol HTTP/HTTPS wikis - https://phabricator.wikimedia.org/T348852 [22:41:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973801 (https://phabricator.wikimedia.org/T348852) (owner: 10Gergő Tisza) [22:42:03] (03PS1) 10Bking: search-loader: remove references to search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/973880 (https://phabricator.wikimedia.org/T351123) [22:42:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage [22:45:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Urbanecm) >>! In T350834#9327613, @Dzahn wrote: > @Urbanecm As the new group approver for this new group would you approve that guy @Urbane... [22:46:51] (03Merged) 10jenkins-bot: session: Remove incorrect warning [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973801 (https://phabricator.wikimedia.org/T348852) (owner: 10Gergő Tisza) [22:47:07] !log tgr@deploy2002 Started scap: Backport for [[gerrit:973801|session: Remove incorrect warning (T348852)]] [22:47:12] T348852: Remove CentralAuth support for mixed-protocol HTTP/HTTPS wikis - https://phabricator.wikimedia.org/T348852 [22:48:25] !log tgr@deploy2002 tgr: Backport for [[gerrit:973801|session: Remove incorrect warning (T348852)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:49:52] !log tgr@deploy2002 tgr: Continuing with sync [22:49:54] !log root@cumin2002 START - Cookbook sre.hosts.decommission for hosts search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet [22:52:14] !log root@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet [22:53:54] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:55:11] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:973801|session: Remove incorrect warning (T348852)]] (duration: 08m 03s) [22:55:24] (03CR) 10Ebernhardson: [C: 03+1] search-loader: remove references to search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/973880 (https://phabricator.wikimedia.org/T351123) (owner: 10Bking) [22:55:24] T348852: Remove CentralAuth support for mixed-protocol HTTP/HTTPS wikis - https://phabricator.wikimedia.org/T348852 [22:59:38] (03CR) 10Urbanecm: [C: 04-1] "elena wants 1 day instead" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [22:59:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1037.eqiad.wmnet with OS bookworm [23:00:43] (03PS2) 10Urbanecm: IP Masking: Set expiryAfterDays to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) [23:01:25] (03PS3) 10Urbanecm: IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) [23:03:25] (03PS1) 10Urbanecm: IP Masking: Set expiryAfterDays to a year [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973881 (https://phabricator.wikimedia.org/T344695) [23:03:34] (03CR) 10Urbanecm: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973881 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [23:10:08] !log UTC late deploys done [23:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1038.eqiad.wmnet with OS bookworm [23:12:29] !log wmf-reimage for stewards1001 failed with [self-signed certificate in certificate chain [23:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:41] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host stewards1001.eqiad.wmnet with OS bookworm [23:13:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host stewards1001.eqiad.wmnet with OS bookworm [23:13:55] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with O... [23:14:09] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet wi... [23:15:03] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 54541 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [23:18:38] Hi, can someone please check Logstash for Serbian Wikipedia? [23:18:55] I'm receiving error from ContentTranslation that translator failed to load because of internal error. [23:19:11] mw.cx.init.Translation.js:234 [CX] Translation initialization failed. [23:19:16] That's from console. [23:19:44] (03PS2) 10RLazarus: Add golang instructions to README [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973280 [23:20:16] Kizule: hi, that seems like a JS error, so your console should have the whole error already? [23:20:17] (03CR) 10RLazarus: [V: 03+2 C: 03+2] Add golang instructions to README (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973280 (owner: 10RLazarus) [23:21:14] urbanecm: I would love to fill a Phabricator task, but since I don't have access to Logstash, I'd like to see the log what's exactly saying, so I don't end up creating duplicate task. [23:21:22] Hi to you as well. :) [23:21:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage [23:21:51] Error from console is just that one line which I provided already. [23:23:07] Search for Kapuljača in Logstash of Content Translation extension in Serbian Wikipedia. [23:23:29] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:23:36] By the way: I never had luck with translating that article from Croatian to Serbian. At least now it's throwing an error. Before it was just loading forever. [23:24:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage [23:25:09] Kizule: well, the most logstash has for client errors is what is already available in the console. in other words, logstash doesn't really have more information than your developer console. [23:32:12] urbanecm: Thank you anyways, I've created https://phabricator.wikimedia.org/T351138 now [23:33:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb1021.eqiad.wmnet [23:41:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1021.eqiad.wmnet [23:44:05] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [23:53:54] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:57:23] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host stewards1001.eqiad.wmnet with OS bookworm [23:57:36] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with O... [23:57:45] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host stewards1001.eqiad.wmnet