[00:01:43] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019376 (owner: 10TrainBranchBot) [00:03:20] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:08:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:13:25] (SystemdUnitFailed) firing: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:54] (03CR) 10Sohom Datta: [C:03+1] Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [00:22:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:22:39] (03CR) 10CI reject: [V:04-1] Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [00:24:30] (03PS2) 10Sohom Datta: Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [00:25:40] (03CR) 10Sohom Datta: "I probably will not have the time to schedule a deploy for this until Wednesday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [00:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:12:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:12:45] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:38:29] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:43] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1020:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:03:25] (SystemdUnitFailed) resolved: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:29] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 830.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:18:36] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 820.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:18:40] (KubernetesAPINotScrapable) firing: (6) k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [03:31:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [03:33:45] (03CR) 10DannyS712: [C:03+1] "failure was" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019376 (owner: 10TrainBranchBot) [03:52:45] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [04:17:25] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:15] (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [04:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:04] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9712224 (10SGupta-WMF) @WDoranWMF Yep , it makes sense . I confirmed with @mforns that API paths and... [05:17:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:29:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:29:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:29:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:30:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T356166)', diff saved to https://phabricator.wikimedia.org/P60482 and previous config saved to /var/cache/conftool/dbconfig/20240415-053001-marostegui.json [05:30:08] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [05:43:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 996ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:11:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T356166)', diff saved to https://phabricator.wikimedia.org/P60483 and previous config saved to /var/cache/conftool/dbconfig/20240415-061114-marostegui.json [06:11:19] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:13:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 882.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:26:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P60484 and previous config saved to /var/cache/conftool/dbconfig/20240415-062621-marostegui.json [06:41:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P60485 and previous config saved to /var/cache/conftool/dbconfig/20240415-064129-marostegui.json [06:56:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T356166)', diff saved to https://phabricator.wikimedia.org/P60486 and previous config saved to /var/cache/conftool/dbconfig/20240415-065636-marostegui.json [06:56:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:56:41] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:56:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:57:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T356166)', diff saved to https://phabricator.wikimedia.org/P60487 and previous config saved to /var/cache/conftool/dbconfig/20240415-065659-marostegui.json [06:59:51] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for IF services (batch three) [puppet] - 10https://gerrit.wikimedia.org/r/1019255 (owner: 10Muehlenhoff) [07:00:05] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T0700). [07:00:05] NMW03: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:43] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1020:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:04:11] I am NMW03 [07:04:49] (03CR) 10Muehlenhoff: [C:03+2] icinga: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1019261 (owner: 10Muehlenhoff) [07:11:46] !log restarting blazegraph on wdqs1020 (BlazegraphFreeAllocatorsDecreasingRapidly) [07:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:34] (KubernetesAPINotScrapable) firing: (6) k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:22:43] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1020:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:48:54] !log restarting k8s-mlstaging and k8s-staging prometheus instances - T343529 [07:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:58] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [07:50:17] (03CR) 10MVernon: [C:03+2] comments: correct typos of "top" for "to" [puppet] - 10https://gerrit.wikimedia.org/r/1019295 (owner: 10MVernon) [07:53:28] (KubernetesAPINotScrapable) resolved: (6) k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:54:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1217.eqiad.wmnet with reason: reboot multiinstance replica [07:55:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1217.eqiad.wmnet with reason: reboot multiinstance replica [07:57:27] jouncebot: nowandnext [07:57:27] For the next 0 hour(s) and 2 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T0700) [07:57:27] In 2 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1000) [08:01:38] !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw [08:01:42] !log depool wdqs in codfw T362508 [08:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:49] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [08:05:46] 10SRE-swift-storage, 06Commons, 10MediaWiki-extensions-Nuke: Double-deletion on Commons - https://phabricator.wikimedia.org/T173825#9712371 (10Samwalton9-WMF) [08:12:45] (03PS1) 10Jcrespo: dbprov: Setup dbprov2005 [puppet] - 10https://gerrit.wikimedia.org/r/1019673 (https://phabricator.wikimedia.org/T362509) [08:13:04] (03PS2) 10Jcrespo: dbprov: Setup dbprov2005 [puppet] - 10https://gerrit.wikimedia.org/r/1019673 (https://phabricator.wikimedia.org/T362509) [08:18:43] (03PS1) 10Hashar: Merge tag 'v3.8.5' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019674 (https://phabricator.wikimedia.org/T354886) [08:19:29] (03CR) 10CI reject: [V:04-1] Merge tag 'v3.8.5' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019674 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [08:22:16] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019376 (owner: 10TrainBranchBot) [08:24:14] (03CR) 10JMeybohm: [V:03+1 C:04-2] "Thanks - but unfortunately also very insufficient" [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [08:26:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T356166)', diff saved to https://phabricator.wikimedia.org/P60488 and previous config saved to /var/cache/conftool/dbconfig/20240415-082623-marostegui.json [08:26:28] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [08:30:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019377 [08:30:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019377 (owner: 10TrainBranchBot) [08:33:25] jouncebot: next [08:33:25] In 1 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1000) [08:33:53] I'm switching logstash.w.o to SSO, cfr T246998 [08:33:55] T246998: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 [08:34:04] (03PS3) 10Jcrespo: dbprov: Setup dbprov2005 [puppet] - 10https://gerrit.wikimedia.org/r/1019673 (https://phabricator.wikimedia.org/T362509) [08:34:14] (03PS4) 10Jcrespo: dbprov: Setup dbprov2005 [puppet] - 10https://gerrit.wikimedia.org/r/1019673 (https://phabricator.wikimedia.org/T362509) [08:35:09] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] opensearch: switch dashboards to sso auth [puppet] - 10https://gerrit.wikimedia.org/r/1018872 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi) [08:35:52] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [08:37:33] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019377 (owner: 10TrainBranchBot) [08:38:26] (03CR) 10Effie Mouzeli: [C:03+1] restbase: migrate to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1019290 (https://phabricator.wikimedia.org/T360636) (owner: 10Hnowlan) [08:41:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P60490 and previous config saved to /var/cache/conftool/dbconfig/20240415-084130-marostegui.json [08:42:24] (03PS8) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) [08:44:17] (03CR) 10Jcrespo: [C:03+2] dbprov: Setup dbprov2005 [puppet] - 10https://gerrit.wikimedia.org/r/1019673 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [08:44:22] (03CR) 10Jcrespo: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1019673/1900/dbprov2005.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1019673 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [08:45:15] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [08:45:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019378 [08:45:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019378 (owner: 10TrainBranchBot) [08:46:18] !log logstash.w.o now uses sso - T246998 [08:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:23] T246998: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 [08:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:09] (03PS1) 10Jcrespo: dbbackups: Migrate s5 codfw snapshots to dbprov2005/db2201 [puppet] - 10https://gerrit.wikimedia.org/r/1019679 (https://phabricator.wikimedia.org/T360751) [08:53:31] !log restart dbprov2005 [08:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:02] godog: I love you for that change! [08:54:28] (03PS6) 10JMeybohm: kubernetes::node: Add support for the SeccompDefault feature gate [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [08:54:34] jynus: you are welcome! I'm pretty happy with it too [08:54:35] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [08:56:13] (03PS1) 10Ladsgroup: Set all wikis to read new for pagelinks migration except trwiki, zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019680 (https://phabricator.wikimedia.org/T351237) [08:56:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P60491 and previous config saved to /var/cache/conftool/dbconfig/20240415-085638-marostegui.json [08:57:57] (03PS2) 10Ladsgroup: Set all wikis to read new for pagelinks migration except trwiki, zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019680 (https://phabricator.wikimedia.org/T351237) [08:58:15] (03CR) 10Hashar: "recheck" [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019674 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [08:58:26] jouncebot: nowandnext [08:58:26] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [08:58:26] In 1 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1000) [08:58:30] awesome [08:58:39] (03CR) 10Ladsgroup: [C:03+2] Set all wikis to read new for pagelinks migration except trwiki, zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019680 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [08:59:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019680 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [08:59:46] (03CR) 10Volans: [C:03+2] "Great! LGTM and I've run tox with all supported python versions also removing the curator deps and unblocking the black upper limit. Mergi" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [09:00:09] (03Merged) 10jenkins-bot: Set all wikis to read new for pagelinks migration except trwiki, zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019680 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [09:00:29] godog: should we worry about the lvs errors of logstash? [09:00:31] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1019680|Set all wikis to read new for pagelinks migration except trwiki, zhwiki (T351237)]] [09:00:43] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [09:00:54] (03CR) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [09:01:16] jynus: not to worry yet, I'm working on a fix tho [09:01:24] ok, thanks [09:01:51] I am also trying to fix the s5 backup ones, but it is taking me a few changes [09:01:52] (03CR) 10Jaime Nuche: [C:03+1] scap: add option to selectivlely disable bootstrapping (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite) [09:02:13] (03CR) 10Jcrespo: [C:03+2] dbbackups: Migrate s5 codfw snapshots to dbprov2005/db2201 [puppet] - 10https://gerrit.wikimedia.org/r/1019679 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [09:02:33] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1019680|Set all wikis to read new for pagelinks migration except trwiki, zhwiki (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:02:46] (03CR) 10Gmodena: analytics: refinery: add webrequest_frontend timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017041 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [09:03:56] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [09:07:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019378 (owner: 10TrainBranchBot) [09:07:40] (03Merged) 10jenkins-bot: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [09:08:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:08:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:08:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T352010)', diff saved to https://phabricator.wikimedia.org/P60492 and previous config saved to /var/cache/conftool/dbconfig/20240415-090834-ladsgroup.json [09:08:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:09:23] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1019680|Set all wikis to read new for pagelinks migration except trwiki, zhwiki (T351237)]] (duration: 08m 51s) [09:09:31] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [09:11:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T356166)', diff saved to https://phabricator.wikimedia.org/P60493 and previous config saved to /var/cache/conftool/dbconfig/20240415-091145-marostegui.json [09:11:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:11:51] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:12:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:12:27] (03CR) 10Hashar: [C:03+2] Merge tag 'v3.8.5' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019674 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [09:14:17] (03PS1) 10Joal: Update yarn scheduler's queues configuration [puppet] - 10https://gerrit.wikimedia.org/r/1019683 (https://phabricator.wikimedia.org/T361499) [09:14:56] !log cgoubert@deploy1002 Started scap: T351237 [09:15:02] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [09:15:13] (03PS1) 10Filippo Giunchedi: opensearch: skip auth for healtcheck endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1019684 (https://phabricator.wikimedia.org/T246998) [09:17:21] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019684 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi) [09:17:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:27] (03PS3) 10Ladsgroup: WMCS: Add --quiet option to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1016912 (owner: 10Tim Starling) [09:18:31] (03CR) 10Ladsgroup: [C:03+2] WMCS: Add --quiet option to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1016912 (owner: 10Tim Starling) [09:18:33] (03CR) 10Ladsgroup: [V:03+2 C:03+2] WMCS: Add --quiet option to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1016912 (owner: 10Tim Starling) [09:18:55] (03Merged) 10jenkins-bot: Merge tag 'v3.8.5' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019674 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [09:19:06] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] opensearch: skip auth for healtcheck endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1019684 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi) [09:19:23] Amir1: merged your change too [09:19:34] oh thanks! [09:19:53] sure np [09:21:22] (03PS1) 10Hashar: Gerrit 3.8.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019685 (https://phabricator.wikimedia.org/T354886) [09:21:34] (03PS1) 10Volans: setup.py: remove dependency elasticsearch-curator [software/spicerack] - 10https://gerrit.wikimedia.org/r/1019686 (https://phabricator.wikimedia.org/T345337) [09:23:58] (03CR) 10Hashar: [C:03+2] Gerrit 3.8.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019685 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [09:24:09] (03CR) 10FNegri: [C:03+1] setup.py: remove dependency elasticsearch-curator [software/spicerack] - 10https://gerrit.wikimedia.org/r/1019686 (https://phabricator.wikimedia.org/T345337) (owner: 10Volans) [09:25:12] (03Merged) 10jenkins-bot: Gerrit 3.8.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019685 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [09:26:39] !log cgoubert@deploy1002 Finished scap: T351237 (duration: 11m 43s) [09:26:44] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [09:28:23] (03PS1) 10Jcrespo: mariadb: Remove db2101 from service and make it a spare [puppet] - 10https://gerrit.wikimedia.org/r/1019689 (https://phabricator.wikimedia.org/T362311) [09:29:35] (03CR) 10Volans: [C:03+2] setup.py: remove dependency elasticsearch-curator [software/spicerack] - 10https://gerrit.wikimedia.org/r/1019686 (https://phabricator.wikimedia.org/T345337) (owner: 10Volans) [09:34:03] (03PS7) 10JMeybohm: kubernetes::node: Add support for the SeccompDefault feature gate [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [09:36:45] (03Merged) 10jenkins-bot: setup.py: remove dependency elasticsearch-curator [software/spicerack] - 10https://gerrit.wikimedia.org/r/1019686 (https://phabricator.wikimedia.org/T345337) (owner: 10Volans) [09:37:10] (03CR) 10JMeybohm: [V:03+1 C:04-2] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1903/c" [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:39:34] godog: I get a massive 403 when I want to access to logstash 😭 [09:39:55] "Sign in" doesn't work either [09:40:29] (03PS1) 10Jcrespo: mariadb: Upgrade db2139 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1019690 (https://phabricator.wikimedia.org/T360751) [09:41:28] (03CR) 10Jcrespo: [C:04-1] "Blocked on a last custom dump of the old host (just in case) as well as a successful backup of the new host." [puppet] - 10https://gerrit.wikimedia.org/r/1019689 (https://phabricator.wikimedia.org/T362311) (owner: 10Jcrespo) [09:41:59] (03PS8) 10JMeybohm: kubernetes::node: Add support for the SeccompDefault feature gate [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [09:42:27] Amir1: mmhh checking [09:43:18] Amir1: works for me :( tried logging in from an incognito window [09:43:39] godog: might be that my LDAP email is not WMF? [09:44:35] Amir1: could be yeah, checking groups now [09:45:01] (03CR) 10JMeybohm: [V:03+1 C:04-2] "PCC SUCCESS (DIFF 12 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:45:42] i think it's that, it's the `email_domains` oauth2_proxy setting. that setting should be unnecessary in our use as we use `required_groups` on the IDP level [09:46:39] ah yeah totally, sending a fix [09:46:47] Amir1: does thanos.w.o let you in ? [09:47:05] godog: nope [09:47:24] ok, fixing [09:48:42] (03CR) 10JMeybohm: [V:03+1] kubernetes::node: Add support for the SeccompDefault feature gate [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:49:04] (03PS1) 10Filippo Giunchedi: oauth2_proxy: default to all email domains [puppet] - 10https://gerrit.wikimedia.org/r/1019692 (https://phabricator.wikimedia.org/T246998) [09:49:13] Amir1: ^ [09:49:50] (03CR) 10Ladsgroup: [C:03+1] oauth2_proxy: default to all email domains [puppet] - 10https://gerrit.wikimedia.org/r/1019692 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi) [09:50:04] thanks [09:50:05] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] oauth2_proxy: default to all email domains [puppet] - 10https://gerrit.wikimedia.org/r/1019692 (https://phabricator.wikimedia.org/T246998) (owner: 10Filippo Giunchedi) [09:50:28] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [09:50:40] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [09:51:54] I am going to upgrade gerrit [09:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:52:53] !log hashar@deploy1002 Started deploy [gerrit/gerrit@2f3d3d4]: Gerrit to 3.8.5 on gerrit2002 - T354886 [09:52:58] T354886: Upgrade to Gerrit 3.8 - https://phabricator.wikimedia.org/T354886 [09:53:01] Amir1: does it work now ? [09:53:02] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@2f3d3d4]: Gerrit to 3.8.5 on gerrit2002 - T354886 (duration: 00m 08s) [09:53:24] now works, awesome. thanks! [09:53:36] sure np [09:55:22] (03PS1) 10Arthur taylor: Change mul deployment on beta to limited version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019694 (https://phabricator.wikimedia.org/T356169) [09:56:56] !log hashar@deploy1002 Started deploy [gerrit/gerrit@2f3d3d4]: Gerrit to 3.8.5 on gerrit1003 - T354886 [09:57:02] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@2f3d3d4]: Gerrit to 3.8.5 on gerrit1003 - T354886 (duration: 00m 06s) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1000) [10:05:57] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1019708 [10:06:41] (03PS1) 10Hashar: Merge branch 'deploy/wmf/stable-3.7' into deploy/wmf/stable-3.8 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019709 [10:08:46] (03CR) 10Hashar: [C:03+2] Merge branch 'deploy/wmf/stable-3.7' into deploy/wmf/stable-3.8 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019709 (owner: 10Hashar) [10:09:24] (03Merged) 10jenkins-bot: Merge branch 'deploy/wmf/stable-3.7' into deploy/wmf/stable-3.8 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019709 (owner: 10Hashar) [10:09:37] (03PS2) 10Jcrespo: mariadb: Remove db2101 from services [puppet] - 10https://gerrit.wikimedia.org/r/1019689 (https://phabricator.wikimedia.org/T362311) [10:10:30] !log hashar@deploy1002 Started deploy [gerrit/gerrit@47eacb9]: Update Javascript plugins for Gerrit 3.8 - T354886 [10:10:35] T354886: Upgrade to Gerrit 3.8 - https://phabricator.wikimedia.org/T354886 [10:10:37] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@47eacb9]: Update Javascript plugins for Gerrit 3.8 - T354886 (duration: 00m 07s) [10:10:51] (03PS2) 10Jcrespo: mariadb: Upgrade db2139 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1019690 (https://phabricator.wikimedia.org/T360751) [10:10:52] (03PS2) 10Clément Goubert: docker: Remove buster-backports from sources.list [puppet] - 10https://gerrit.wikimedia.org/r/1019706 (https://phabricator.wikimedia.org/T362518) [10:11:22] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM, can be deployed at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019694 (https://phabricator.wikimedia.org/T356169) (owner: 10Arthur taylor) [10:13:59] (03CR) 10Clément Goubert: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1907/co" [puppet] - 10https://gerrit.wikimedia.org/r/1019706 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [10:14:33] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1019708 (owner: 10Volans) [10:16:18] (03PS3) 10Clément Goubert: docker: Remove buster-backports from sources.list [puppet] - 10https://gerrit.wikimedia.org/r/1019706 (https://phabricator.wikimedia.org/T362518) [10:17:58] (03CR) 10Clément Goubert: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1908/co" [puppet] - 10https://gerrit.wikimedia.org/r/1019706 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [10:19:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1019706 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [10:19:42] (03CR) 10Clément Goubert: [V:03+1 C:03+2] docker: Remove buster-backports from sources.list [puppet] - 10https://gerrit.wikimedia.org/r/1019706 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [10:20:12] (03PS1) 10Muehlenhoff: Add a component/lilypond [puppet] - 10https://gerrit.wikimedia.org/r/1019710 (https://phabricator.wikimedia.org/T362518) [10:20:48] (03CR) 10Clément Goubert: [C:03+1] Add a component/lilypond [puppet] - 10https://gerrit.wikimedia.org/r/1019710 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [10:20:58] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1019708 (owner: 10Volans) [10:22:17] (03CR) 10Muehlenhoff: [C:03+2] Add a component/lilypond [puppet] - 10https://gerrit.wikimedia.org/r/1019710 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [10:22:45] !log Launching build-base-images on build2001 - T362518 [10:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:50] T362518: Deprecate buster-backports - https://phabricator.wikimedia.org/T362518 [10:24:28] (03PS1) 10Volans: Upstream release v8.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1019711 [10:27:53] (03PS1) 10Filippo Giunchedi: wmflib: add magru to sites [puppet] - 10https://gerrit.wikimedia.org/r/1019712 (https://phabricator.wikimedia.org/T346722) [10:28:01] (03CR) 10Ladsgroup: "this is good but maybe wikipedi-pl-sysop? Idea from https://meta.wikimedia.org/wiki/Mailing_lists/Standardization" [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [10:30:26] (03CR) 10Volans: [C:03+1] "This patch was planned for this week in preparation for the new DC but AFAIK no prometheus instance should poll from it right now. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1019712 (https://phabricator.wikimedia.org/T346722) (owner: 10Filippo Giunchedi) [10:31:02] !log imported lilypond/lilypond-data 2.22.0-10~bpo10+1 to component/lilypond T362518 [10:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:07] T362518: Deprecate buster-backports - https://phabricator.wikimedia.org/T362518 [10:31:44] !log bounce prometheus@k8s-staging in eqiad - T343529 [10:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:48] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [10:33:19] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [10:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:28] (KubernetesAPINotScrapable) resolved: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [10:38:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:38:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:38:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T356166)', diff saved to https://phabricator.wikimedia.org/P60494 and previous config saved to /var/cache/conftool/dbconfig/20240415-103853-marostegui.json [10:38:59] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:42:37] (03CR) 10Slyngshede: IP blocking (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1018256 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [10:42:55] (03PS1) 10Aklapper: phabricator weekly changes email: List Diffusion repository renames [puppet] - 10https://gerrit.wikimedia.org/r/1019715 (https://phabricator.wikimedia.org/T197699) [10:46:01] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [10:46:02] (03CR) 10Slyngshede: "Ideally I would have liked to use SSH fingerprints, but the version of Paramiko we use has some limitation regarding specifically loading " [software/bitu] - 10https://gerrit.wikimedia.org/r/1019271 (https://phabricator.wikimedia.org/T359532) (owner: 10Slyngshede) [10:48:59] (03CR) 10Slyngshede: Update links to point to non-wiki privacy policy and bypass redirects (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) (owner: 10Pppery) [10:51:44] (03CR) 10Majavah: [C:03+1] Upstream release v8.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1019711 (owner: 10Volans) [10:52:29] (03PS1) 10Muehlenhoff: Add a component/shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/1019717 (https://phabricator.wikimedia.org/T362518) [10:54:01] (03CR) 10Volans: [C:03+2] Upstream release v8.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1019711 (owner: 10Volans) [10:56:14] (03CR) 10Muehlenhoff: [C:03+2] Add a component/shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/1019717 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [11:00:30] (03Merged) 10jenkins-bot: Upstream release v8.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1019711 (owner: 10Volans) [11:03:31] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [11:06:46] (03PS1) 10Volans: constants: add the new magru datacenter [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1019720 (https://phabricator.wikimedia.org/T346722) [11:07:44] !log imported shellcheck 0.7.1-1~bpo10+1 to component/shellcheck T362518 [11:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:48] T362518: Deprecate buster-backports - https://phabricator.wikimedia.org/T362518 [11:11:39] (03PS1) 10Muehlenhoff: Remove obsolete apt::pin for buster-backports [puppet] - 10https://gerrit.wikimedia.org/r/1019721 (https://phabricator.wikimedia.org/T362518) [11:13:32] !log uploaded spicerack_8.5.0 to apt.wikimedia.org bullseye-wikimedia [11:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:54] taavi: ^^^ new spicerack release also for you :) I'll probably upgrade one cumin host after lunch and test the new things, but let me know if you need anything else. I'm notifying also the usual owners for cloudcumin [11:16:21] volans: <3 I can take care of upgrading cloudcumins [11:16:30] (03CR) 10Ayounsi: [C:03+1] constants: add the new magru datacenter [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1019720 (https://phabricator.wikimedia.org/T346722) (owner: 10Volans) [11:17:02] (03CR) 10Volans: [C:03+2] constants: add the new magru datacenter [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1019720 (https://phabricator.wikimedia.org/T346722) (owner: 10Volans) [11:17:04] (KubernetesAPINotScrapable) firing: k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:17:44] up to you if you want to wait me for testing it in prod or be brave and go ahead before that :D [11:21:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T356166)', diff saved to https://phabricator.wikimedia.org/P60495 and previous config saved to /var/cache/conftool/dbconfig/20240415-112102-marostegui.json [11:21:08] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [11:21:42] (KubernetesAPINotScrapable) firing: (4) k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:24:06] (03Merged) 10jenkins-bot: constants: add the new magru datacenter [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1019720 (https://phabricator.wikimedia.org/T346722) (owner: 10Volans) [11:26:42] (KubernetesAPINotScrapable) firing: (4) k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:30:18] (03PS7) 10Winston Sung: zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427) (owner: 10S8321414) [11:31:21] (03PS1) 10Muehlenhoff: Only install Go from backports on bullseye-based stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1019726 (https://phabricator.wikimedia.org/T362518) [11:31:42] (KubernetesAPINotScrapable) resolved: (3) k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:36:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P60496 and previous config saved to /var/cache/conftool/dbconfig/20240415-113610-marostegui.json [11:36:43] (03PS1) 10Stevemunene: Upgrading datahub to v0.12.1 We are upgrading datahub to v0.12.1 in response to some vulnerabilities in versions < v0.12.0 v0.13.0 is the latest stable release but is only compatible with Java 17 thus we are using v0.12.1 the last stable release that supports Java 11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019729 (https://phabricator.wikimedia.org/T361688) [11:37:44] (03PS1) 10Muehlenhoff: beta::mediawiki_packages: Install lilypond from component [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) [11:38:12] (03CR) 10CI reject: [V:04-1] beta::mediawiki_packages: Install lilypond from component [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [11:38:24] (03PS2) 10Stevemunene: Upgrading datahub to v0.12.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019729 (https://phabricator.wikimedia.org/T361688) [11:40:59] (03PS2) 10Muehlenhoff: beta::mediawiki_packages: Install lilypond from component [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) [11:47:31] (03PS1) 10Majavah: hieradata: Drop defusedxml dependency from cloudcumins [puppet] - 10https://gerrit.wikimedia.org/r/1019732 (https://phabricator.wikimedia.org/T314664) [11:50:51] (03PS1) 10Jcrespo: test [puppet] - 10https://gerrit.wikimedia.org/r/1019733 [11:51:01] (03PS1) 10Muehlenhoff: Only add backports on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1019734 (https://phabricator.wikimedia.org/T362518) [11:51:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P60497 and previous config saved to /var/cache/conftool/dbconfig/20240415-115118-marostegui.json [11:51:45] (03CR) 10FNegri: [C:03+1] "LGTM, thanks for spotting this." [puppet] - 10https://gerrit.wikimedia.org/r/1019732 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [11:51:50] (03CR) 10Majavah: [C:03+2] hieradata: Drop defusedxml dependency from cloudcumins [puppet] - 10https://gerrit.wikimedia.org/r/1019732 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [11:52:03] (03PS1) 10Jcrespo: test2 [puppet] - 10https://gerrit.wikimedia.org/r/1019735 [11:53:33] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1019734 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [11:53:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1019712 (https://phabricator.wikimedia.org/T346722) (owner: 10Filippo Giunchedi) [11:55:19] (03CR) 10Muehlenhoff: [C:03+2] Only add backports on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1019734 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [11:56:14] (03PS2) 10Jcrespo: test [puppet] - 10https://gerrit.wikimedia.org/r/1019733 [11:57:06] (03PS2) 10TChin: Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [11:57:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 depool T360116', diff saved to https://phabricator.wikimedia.org/P60498 and previous config saved to /var/cache/conftool/dbconfig/20240415-115708-arnaudb.json [11:57:13] T360116: Upgrade s5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T360116 [11:57:22] (03CR) 10TChin: Add datasets-config helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:58:02] (03CR) 10CI reject: [V:04-1] Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:58:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2128,2186].codfw.wmnet with reason: upgrade db2128 T360116 [11:58:20] (03PS3) 10Jcrespo: mariadb: Add dbprov2005 to the list of hosts that can access dbbackups db [puppet] - 10https://gerrit.wikimedia.org/r/1019733 (https://phabricator.wikimedia.org/T362509) [11:58:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2128,2186].codfw.wmnet with reason: upgrade db2128 T360116 [11:58:33] (03PS4) 10Jcrespo: mariadb: Add dbprov2005 to the list of hosts that can access dbbackups db [puppet] - 10https://gerrit.wikimedia.org/r/1019733 (https://phabricator.wikimedia.org/T362509) [11:58:53] (03PS5) 10Jcrespo: mariadb: Add dbprov2005 to the list of hosts with dbbackups db access [puppet] - 10https://gerrit.wikimedia.org/r/1019733 (https://phabricator.wikimedia.org/T362509) [12:00:11] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2128.codfw.wmnet [12:01:24] (03PS2) 10Jcrespo: dbbackups: Add dbprov2005 to the list of hosts that can backup m1 [puppet] - 10https://gerrit.wikimedia.org/r/1019735 (https://phabricator.wikimedia.org/T362509) [12:03:20] (03CR) 10Jcrespo: [C:03+2] mariadb: Add dbprov2005 to the list of hosts with dbbackups db access [puppet] - 10https://gerrit.wikimedia.org/r/1019733 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [12:04:38] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-eqiad and not P{cp1112.eqiad.wmnet,cp1113.eqiad.wmnet,cp1115.eqiad.wmnet} and A:cp [12:05:53] (03PS3) 10TChin: Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [12:06:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2128.codfw.wmnet [12:06:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T356166)', diff saved to https://phabricator.wikimedia.org/P60499 and previous config saved to /var/cache/conftool/dbconfig/20240415-120627-marostegui.json [12:06:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [12:06:32] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:06:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [12:06:45] (03CR) 10CI reject: [V:04-1] Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:06:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T356166)', diff saved to https://phabricator.wikimedia.org/P60500 and previous config saved to /var/cache/conftool/dbconfig/20240415-120650-marostegui.json [12:06:56] (03CR) 10Filippo Giunchedi: [C:03+2] wmflib: add magru to sites [puppet] - 10https://gerrit.wikimedia.org/r/1019712 (https://phabricator.wikimedia.org/T346722) (owner: 10Filippo Giunchedi) [12:09:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2128.codfw.wmnet with OS bookworm [12:10:10] !deploy new database grants for m1 <- dbbprov1005 [12:10:33] jynus: it looks like you missed !log [12:10:41] vgutierrez: thank you, indeed [12:10:50] good also given it failed [12:12:24] !log deploy new database grants for m1 <- dbbprov1005 [12:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:41] now it worked (the change) [12:21:41] (03PS1) 10Hnowlan: aptrepo: remove buster-backports haproxy components [puppet] - 10https://gerrit.wikimedia.org/r/1019748 (https://phabricator.wikimedia.org/T362518) [12:24:59] (03CR) 10Muehlenhoff: "We can rather keep them around. External users of apt.wikimedia.org might still use them (or cloud VPS etc). When a distro gets retired fo" [puppet] - 10https://gerrit.wikimedia.org/r/1019748 (https://phabricator.wikimedia.org/T362518) (owner: 10Hnowlan) [12:25:10] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019766 [12:26:17] (03CR) 10Hnowlan: "Ah, fair enough!" [puppet] - 10https://gerrit.wikimedia.org/r/1019748 (https://phabricator.wikimedia.org/T362518) (owner: 10Hnowlan) [12:26:26] (03Abandoned) 10Hnowlan: aptrepo: remove buster-backports haproxy components [puppet] - 10https://gerrit.wikimedia.org/r/1019748 (https://phabricator.wikimedia.org/T362518) (owner: 10Hnowlan) [12:29:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2128.codfw.wmnet with reason: host reimage [12:32:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2128.codfw.wmnet with reason: host reimage [12:34:39] !log manually apply 5m retention policy to thanos data, blocks will be deleted in 48h - T351927 [12:35:14] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-eqiad and not P{cp1112.eqiad.wmnet,cp1113.eqiad.wmnet,cp1115.eqiad.wmnet} and A:cp [12:39:23] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9713041 (10Gehel) [12:40:06] 07sre-alert-triage, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496#9713045 (10Gehel) [12:40:58] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.04.15 - 2024.05.05): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9713057 (10Gehel) [12:46:59] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), and 2 others: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9713106 (10Gehel) [12:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T356166)', diff saved to https://phabricator.wikimedia.org/P60501 and previous config saved to /var/cache/conftool/dbconfig/20240415-124848-marostegui.json [12:48:53] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:51:40] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05): Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9713129 (10Gehel) [12:54:08] (03PS3) 10Sohom Datta: Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [12:54:20] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9713149 (10MoritzMuehlenhoff) You don't need 4G of RAM, 2 should be perfectly fine. Also, let's not use idm2001-dev, that's too confusin with the... [12:54:30] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9713152 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:55:10] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9713154 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:55:26] (03PS1) 10Fabfur: haproxy: increase accept-language max length [puppet] - 10https://gerrit.wikimedia.org/r/1019760 (https://phabricator.wikimedia.org/T360415) [12:56:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2128.codfw.wmnet with OS bookworm [12:57:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1007933 (https://phabricator.wikimedia.org/T357496) (owner: 10Gehel) [13:00:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60502 and previous config saved to /var/cache/conftool/dbconfig/20240415-130005-arnaudb.json [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1300) [13:00:06] Dreamy_Jazz, NMW03, codders, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] \o [13:00:15] i can deploy today [13:00:15] I can self deploy [13:00:24] i can't self-deploy :) [13:00:33] o/ [13:00:41] I can deploy codders’s change ^^ [13:00:46] Dreamy_Jazz: i think we can speed the window up by deploying some of the patches together [13:00:53] Sure. [13:00:56] otherwise it'd be 20-25 mins per change easily [13:01:09] My change should be a no-op [13:01:26] (03PS3) 10Jcrespo: dbbackups: Add dbprov2005 to the list of hosts that can backup m1 [puppet] - 10https://gerrit.wikimedia.org/r/1019735 (https://phabricator.wikimedia.org/T362509) [13:01:36] (03PS13) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [13:01:42] (03CR) 10Urbanecm: [C:03+2] Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [13:01:57] "Urbanecm on behalf of Dreamy Jazz", that's a new one [13:02:03] IKR [13:02:14] o_O [13:02:27] I think it used to prevent rebases using the rebase button causing you to the "uploader" [13:02:34] (03Merged) 10jenkins-bot: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [13:02:35] Where in theory you didn't really upload anything. [13:02:46] (03CR) 10Urbanecm: "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019694 (https://phabricator.wikimedia.org/T356169) (owner: 10Arthur taylor) [13:02:50] (03CR) 10Urbanecm: [C:03+2] Change mul deployment on beta to limited version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019694 (https://phabricator.wikimedia.org/T356169) (owner: 10Arthur taylor) [13:03:41] codders: your change should be on beta soon (it self-deploys once it is merged to the repo; it should be there in less than an hour) [13:03:45] (03Merged) 10jenkins-bot: Change mul deployment on beta to limited version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019694 (https://phabricator.wikimedia.org/T356169) (owner: 10Arthur taylor) [13:03:49] perfect - thank you! [13:03:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P60503 and previous config saved to /var/cache/conftool/dbconfig/20240415-130355-marostegui.json [13:04:06] it doesn't seem like NMW03 is here [13:04:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [13:04:39] did not mean to do that [13:05:09] Should be able to remove the +2 and exit scap backport I think? [13:05:18] yeah, i did that [13:05:34] (03CR) 10Btullis: [V:03+1 C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1007933 (https://phabricator.wikimedia.org/T357496) (owner: 10Gehel) [13:05:59] but it takes ~5 secs to find where that button is in new gerrit :D [13:06:02] anway, done [13:06:02] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362107#9713223 (10Gehel) p:05Triage→03Medium [13:06:04] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1014526|Add wgAutoCreateTempUser configuration for production (T349506 T337090)]], [[gerrit:1019694|Change mul deployment on beta to limited version (T356169)]] [13:06:04] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362106#9713225 (10Gehel) p:05Triage→03Medium [13:06:13] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [13:06:14] T337090: Disallow certain numbers from being generated in the temporary account creation process - https://phabricator.wikimedia.org/T337090 [13:06:14] T356169: MUL - Phased rollout on Wikidata.org (Stage 1 of 3: Test release) - https://phabricator.wikimedia.org/T356169 [13:06:14] urbanecm: NMW03 also just joined here ^^ [13:06:23] oh, hello NMW03! [13:06:25] hi [13:06:49] i'll get to your patch soon [13:06:53] 13:06:37 Build of K8s images failed (non-K8s deployment will continue normally) [13:06:55] does not look good [13:06:56] (03PS2) 10Gehel: query_service: refactoring 'query_service::monitor::updater' [puppet] - 10https://gerrit.wikimedia.org/r/1007933 (https://phabricator.wikimedia.org/T357496) [13:07:05] ouch [13:07:05] thanks [13:07:24] it...can't find mirrors.wikimedia.org? what? [13:07:57] oh, there was some discussion about that earlier… I’m not actually sure if it was resolved or not? [13:07:59] related to https://phabricator.wikimedia.org/T362518 [13:08:01] !log urbanecm@deploy1002 urbanecm and arthurtaylor and dreamyjazz: Backport for [[gerrit:1014526|Add wgAutoCreateTempUser configuration for production (T349506 T337090)]], [[gerrit:1019694|Change mul deployment on beta to limited version (T356169)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:23] oh, thanks. looking at the task [13:08:36] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-esams and not P{cp3066.esams.wmnet,cp3069.esams.wmnet,cp3070.esams.wmnet,cp3071.esams.wmnet,cp3072.esams.wmnet,cp3073.esams.wmnet} and A:cp [13:08:37] claime: ^ [13:08:49] did the buster base image rebuild finish already? [13:10:14] * urbanecm is waiting with deployment for someone to take a look at the scap issue [13:10:46] I’m guessing “Do a full rebuild deployment of mw-on-k8s” (in that task) would take care of k8s eventually™ [13:11:01] but it’s probably not good to have a new uploaders group configured only on bare metal (30% of external requests) until that happens [13:11:10] exactly [13:11:32] whereas Dreamy_Jazz’s change sounds more like it would be okay to sync (but also, not urgent, I guess, so might as well wait) [13:11:37] considering where k8s rollout is at, scap should just fail hard when it can't build k8s [13:11:54] My change should be a no-op, so in theory nothing should break if it was half-deployed [13:12:06] Lucas_WMDE: (re "Do a full rebuild deployment of mw-on-k8s"...) yes, but I suspect there's some intermediary image that needs to be rebuilt first [13:12:08] But isn't urgent so could wait till tommorrow. [13:13:52] Lucas_WMDE: are you talking about my patch [13:13:57] (03CR) 10Dreamy Jazz: [C:03+1] Schedule weekly purge of global_block_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [13:14:07] NMW03: no, we're facing an issue with deployment generally. please hang on. [13:14:09] NMW03: yes, I’m afraid it might not be deployable right now due to infrastructure issues :/ [13:14:15] but nothing wrong with your patch in particular [13:14:20] +1 [13:14:23] thank you, no problem [13:14:28] Anyone available for https://gerrit.wikimedia.org/r/1015551 ? [13:14:40] so it seems like the /buster base image was updated [13:15:00] Deployments are on hold for the time being Winston_Sung. [13:15:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60504 and previous config saved to /var/cache/conftool/dbconfig/20240415-131510-arnaudb.json [13:16:10] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9713291 (10ssingh) >>! In T350179#9711211, @Papaul wrote: > @ssingh one thing that I found between the server NiC and the switch interfa... [13:16:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp1115.eqiad.wmnet [13:17:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:26] * taavi tries to figure out how to `docker-pkg update` [13:17:30] (03PS1) 10Fabfur: benthos/haproxy: enable Benthos logging on all ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1019765 (https://phabricator.wikimedia.org/T358109) [13:18:10] taavi: i'm waiting for you/SRE for now, please ping me if i should abort and reschedule stuff for later. [13:19:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P60505 and previous config saved to /var/cache/conftool/dbconfig/20240415-131902-marostegui.json [13:19:20] !log upgraed spicerack to v8.5.0 on cumin2002 [13:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:42] Oh, on second thoughts my change may not be a no-op if beta wikis partly run on k8s. [13:20:02] wdym? [13:20:16] The offset value being set for bare metal vs not being set for k8s [13:20:23] beta doesn't run on k8s [13:20:30] Okay. [13:20:36] Didn't know if it did or not. [13:20:44] In which case, ignore that :) [13:20:54] @Dreamy_Jazz: So by "on hold for the time being," does it means there will be no more config deployments in this window or might be available later in this window? [13:20:56] but I suggest we wait for the UBN to be resolved [13:21:27] I'd probably say no deployments in this window unless https://phabricator.wikimedia.org/T362518 gets solved soon. [13:21:43] OK. Thanks. [13:21:53] Winston_Sung: it's fairly unlikely we will have enough time for something new. please use the calendar (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1300) to schedule deployments if possible, that'd make it easier to run windows. thanks! [13:26:40] as an update, I see that most images have been succesfully rebuilt and pushed already, so we are close. There are some errors for some images but unrelated to train stuff [13:27:50] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019765 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:28:40] 10ops-eqiad, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T362366#9713344 (10phaultfinder) [13:29:10] (03CR) 10Ssingh: [C:03+1] "PCC looks good on text and upload!" [puppet] - 10https://gerrit.wikimedia.org/r/1019765 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:29:39] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9713348 (10Volans) With the above patch I think the issue should be solve... [13:30:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60506 and previous config saved to /var/cache/conftool/dbconfig/20240415-133016-arnaudb.json [13:30:39] Hello, I see deployments are on hold — T362530 is likely(?) solvable by `FlowFixInconsistentBoards`, would we prefer waiting until that UBN is resolved, or would a maintenance script be ok to run? [13:30:40] T362530: Fatal exception of type "Flow\Exception\InvalidDataException": The Structured Discussions workflow is not associated with this page. - https://phabricator.wikimedia.org/T362530 [13:30:59] TheresNoTime: wait it out, we are close. [13:31:04] ack [13:31:33] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019729 (https://phabricator.wikimedia.org/T361688) (owner: 10Stevemunene) [13:31:48] or at least we hope so, we got some issues figuring out exactly where in the process we are, thanks to buffered logs (apparently) [13:33:08] (03CR) 10Btullis: [C:03+2] analytics: refinery: add webrequest_frontend timer [puppet] - 10https://gerrit.wikimedia.org/r/1017041 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [13:34:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T356166)', diff saved to https://phabricator.wikimedia.org/P60507 and previous config saved to /var/cache/conftool/dbconfig/20240415-133410-marostegui.json [13:34:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:34:17] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [13:34:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:34:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T356166)', diff saved to https://phabricator.wikimedia.org/P60508 and previous config saved to /var/cache/conftool/dbconfig/20240415-133433-marostegui.json [13:35:27] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9713365 (10MoritzMuehlenhoff) [13:37:07] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-esams and not P{cp3066.esams.wmnet,cp3069.esams.wmnet,cp3070.esams.wmnet,cp3071.esams.wmnet,cp3072.esams.wmnet,cp3073.esams.wmnet} and A:cp [13:37:37] (03CR) 10Vgutierrez: [C:03+1] benthos/haproxy: enable Benthos logging on all ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1019765 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:38:57] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9713370 (10fnegri) Works for me! 🎉 ` spicerack (master) $ docker run --r... [13:40:06] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9713373 (10fnegri) `pip install wikimedia-spicerack` is also working fine... [13:42:23] (03PS1) 10JMeybohm: Remove the tiller image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1019809 (https://phabricator.wikimedia.org/T251305) [13:43:34] (03PS1) 10Ayounsi: Puppet: add magru [puppet] - 10https://gerrit.wikimedia.org/r/1019810 [13:44:22] (03PS2) 10Ayounsi: Puppet: add magru [puppet] - 10https://gerrit.wikimedia.org/r/1019810 [13:44:37] (03CR) 10Effie Mouzeli: [C:03+1] Remove the tiller image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1019809 (https://phabricator.wikimedia.org/T251305) (owner: 10JMeybohm) [13:45:10] !log update thirdparty/haproxy28 to 2.8.9 for bullseye-wikimedia (apt.wm.o) [13:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:19] (03PS1) 10Klausman: /home/klausman: increase tmux scrollback/history [puppet] - 10https://gerrit.wikimedia.org/r/1019811 [13:45:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60509 and previous config saved to /var/cache/conftool/dbconfig/20240415-134522-arnaudb.json [13:45:41] (03CR) 10Majavah: Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [13:45:51] (03CR) 10Bking: sre.hosts.decommission: ask on failure (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [13:46:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2111.codfw.wmnet with reason: reboot multiinstance replica [13:46:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2111.codfw.wmnet with reason: reboot multiinstance replica [13:46:45] (03CR) 10Klausman: [C:03+2] /home/klausman: increase tmux scrollback/history [puppet] - 10https://gerrit.wikimedia.org/r/1019811 (owner: 10Klausman) [13:47:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 depool', diff saved to https://phabricator.wikimedia.org/P60510 and previous config saved to /var/cache/conftool/dbconfig/20240415-134710-arnaudb.json [13:47:19] (03PS1) 10Vgutierrez: hiera: Move from HAProxy 2.7 to HAProxy 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1019812 [13:48:31] (03CR) 10Volans: sre.hosts.decommission: ask on failure (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [13:48:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2111.codfw.wmnet with OS bookworm [13:51:30] TheresNoTime: the base image rebuilds are done (so mw should build again) [13:51:47] 06SRE, 06Content-Transform-Team-WIP, 10MW-on-K8s, 06serviceops, and 4 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9713422 (10MSantos) [13:51:58] I've not read the whole scrollback here, but it seemed from the recent lines you're waiting to deploy anyways? [13:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:52:31] jayme: I am only waiting to run a maintenance script, some others may be waiting to deploy though [13:52:32] (03CR) 10JMeybohm: [V:03+2 C:03+2] Remove the tiller image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1019809 (https://phabricator.wikimedia.org/T251305) (owner: 10JMeybohm) [13:52:47] TheresNoTime: ah, okay. Got it [13:53:16] jouncebot: nowandnext [13:53:16] For the next 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1300) [13:53:17] In 1 hour(s) and 36 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1530) [13:53:41] "one time" [13:54:03] *on ... should maybe take a break [13:54:51] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [13:54:53] !log `[samtar@mwmaint1002 ~]$ mwscript extensions/Flow/maintenance/FlowFixInconsistentBoards.php --wiki=zhwiki --namespaceName User_talk` T362530 [13:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:59] T362530: Fatal exception of type "Flow\Exception\InvalidDataException": The Structured Discussions workflow is not associated with this page. - https://phabricator.wikimedia.org/T362530 [13:55:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9713433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [13:55:52] TheresNoTime: would you be so nice to ping me when that's done? [13:56:03] jayme: done :) [13:56:07] lol, thanks :D [13:56:28] urbanecm: Did you cancel the deployment? [13:56:51] (03PS2) 10Fabfur: haproxy: increase various headers max length [puppet] - 10https://gerrit.wikimedia.org/r/1019760 (https://phabricator.wikimedia.org/T360415) [13:57:06] It seems that deployments can now continue, but we don't have much time left for the window. [13:57:54] not yet [13:58:08] jouncebot: nowandnext [13:58:08] For the next 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1300) [13:58:08] In 1 hour(s) and 31 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1530) [13:58:16] !log urbanecm@deploy1002 sync-world aborted: Backport for [[gerrit:1014526|Add wgAutoCreateTempUser configuration for production (T349506 T337090)]], [[gerrit:1019694|Change mul deployment on beta to limited version (T356169)]] (duration: 52m 11s) [13:58:19] cancelled now [13:58:25] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [13:58:26] (03CR) 10Fabfur: [C:03+1] "Looks good, let me know when this will be applied so I'll work on Benthos later on those hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1019812 (owner: 10Vgutierrez) [13:58:26] T337090: Disallow certain numbers from being generated in the temporary account creation process - https://phabricator.wikimedia.org/T337090 [13:58:26] T356169: MUL - Phased rollout on Wikidata.org (Stage 1 of 3: Test release) - https://phabricator.wikimedia.org/T356169 [13:58:35] jayme: should we redeploy? [13:59:05] urbanecm: that would be nice. I'm not exactly sure but AIUI code was deplopyed to metal but not k8s [13:59:29] !log update dbprov2005 dbbackups password T362509 [13:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:34] T362509: Setup new dbprov hosts and decommission the old ones - https://phabricator.wikimedia.org/T362509 [13:59:50] urbanecm: k8s image build should not be fixed, so a deploy to only k8s might be enough (with -D full_image_build:true ) [14:00:09] I can do that as well ofc. ... [14:00:59] jayme: it wasn't, i waited on the test stage [14:01:05] let me rerun it in full [14:01:13] ah, nice. Thanks! [14:01:23] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1014526|Add wgAutoCreateTempUser configuration for production (T349506 T337090)]], [[gerrit:1019694|Change mul deployment on beta to limited version (T356169)]] [14:04:08] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster relforge: T361647 - bking@cumin2002 [14:04:09] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster relforge: T361647 - bking@cumin2002 [14:04:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2111.codfw.wmnet with reason: host reimage [14:04:15] T361647: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647 [14:04:22] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: T361647 - bking@cumin2002 [14:06:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2111.codfw.wmnet with reason: host reimage [14:09:19] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: 14spicerack: tox fails to install PyYAML using python 3.11 on bookworm - 14https://phabricator.wikimedia.org/T345337#9713473 (10Volans) 05Stalled→03Resolved 14Resolving then, th... [14:09:23] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533 (10FNavas-foundation) 03NEW [14:09:57] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: T361647 - bking@cumin2002 [14:10:02] T361647: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647 [14:10:09] (03CR) 10Stevemunene: [C:03+2] Upgrading datahub to v0.12.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019729 (https://phabricator.wikimedia.org/T361688) (owner: 10Stevemunene) [14:11:00] (03PS1) 10Jcrespo: mariadb: Add dbprov2005 to the grants for s*, x* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1019815 (https://phabricator.wikimedia.org/T362509) [14:11:02] (03PS1) 10Jcrespo: installserver: Setup db and dbprov hosts back to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) [14:11:08] (03Merged) 10jenkins-bot: Upgrading datahub to v0.12.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019729 (https://phabricator.wikimedia.org/T361688) (owner: 10Stevemunene) [14:12:33] (03CR) 10Gmodena: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1019760 (https://phabricator.wikimedia.org/T360415) (owner: 10Fabfur) [14:13:34] !log uploaded tcp-mss-clamper 0.4+deb11u2 to bullseye-wikimedia (apt.wm.o) [14:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:07] (03CR) 10Elukey: [C:03+2] role::cassandra_dev: move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019272 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [14:14:15] !log urbanecm@deploy1002 urbanecm and dreamyjazz and arthurtaylor: Backport for [[gerrit:1014526|Add wgAutoCreateTempUser configuration for production (T349506 T337090)]], [[gerrit:1019694|Change mul deployment on beta to limited version (T356169)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:14:31] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [14:14:32] T337090: Disallow certain numbers from being generated in the temporary account creation process - https://phabricator.wikimedia.org/T337090 [14:14:32] T356169: MUL - Phased rollout on Wikidata.org (Stage 1 of 3: Test release) - https://phabricator.wikimedia.org/T356169 [14:14:34] I can test. [14:14:40] please go ahead [14:15:36] (03CR) 10Jcrespo: [C:04-1] "This is blocked on me until https://phabricator.wikimedia.org/T355353#9709152 is solved, but please check that the change for core dbs is " [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [14:16:04] (03CR) 10Vgutierrez: [C:03+2] hiera: Move from HAProxy 2.7 to HAProxy 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1019812 (owner: 10Vgutierrez) [14:16:25] !log move cassandra instances on cassandra-dev to pki - T352647 [14:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:32] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [14:17:10] (03CR) 10Hnowlan: [C:04-1] beta::mediawiki_packages: Install lilypond from component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [14:17:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T356166)', diff saved to https://phabricator.wikimedia.org/P60511 and previous config saved to /var/cache/conftool/dbconfig/20240415-141710-marostegui.json [14:17:15] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [14:17:58] My change doesn't seem to be working. [14:18:19] Actually, can I use mwdebug servers on beta wikis? [14:18:26] no [14:18:29] Ah. I see. [14:18:30] not implemneted there [14:18:32] !log urbanecm@deploy1002 urbanecm and dreamyjazz and arthurtaylor: Continuing with sync [14:18:34] proceeding [14:18:51] In which case there is nothing I can do to test this :) [14:20:59] The WikimediaDebug extension could probably make it clearer that it doesn't work on betawiki [14:21:26] The "unsupported domain" banner is not displayed for betawikis [14:21:36] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4051.ulsfo.wmnet,cp5030.eqsin.wmnet,cp5032.eqsin.wmnet} and A:cp [14:21:44] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9713578 (10ssingh) Hi @FNavas-foundation: @Aitolkyn already should already have access to Superset as they are part of the `analytics_privatedata_users` group. @Aitolkyn, can you please confirm? Th... [14:22:13] Oh. I've realised that there is a bug in that change [14:22:16] Dreamy_Jazz: define support. All wikimediadebug features are, to my knowledge, implemented and supported in beta. That's very much intentional. [14:23:10] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:24:38] urbanecm: The bug only affects betawikis, but means I will want to deploy a fix for that shortly. [14:24:48] I can do that myself. [14:26:27] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9713602 (10Aitolkyn) @ssingh Thank you for checking! I get the following error when trying to access my tables: ` mysql error: SELECT command denied to user 'research'@'10.67.30.187' for table `a... [14:26:51] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9713603 (10FNavas-foundation) Unsure why @Aitolkyn isn't have access now then :// As for the other stuff — expiry_date: "2024-09-01" expiry_contact: fnavas@wikimedia.org [14:27:21] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9713590 (10ayounsi) p:05Triage→03High a:03ayounsi [14:27:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2111.codfw.wmnet with OS bookworm [14:30:09] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4051.ulsfo.wmnet,cp5030.eqsin.wmnet,cp5032.eqsin.wmnet} and A:cp [14:31:04] (03PS1) 10Dreamy Jazz: Define 'useYear' as true for temp user serial mapping config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019821 (https://phabricator.wikimedia.org/T349506) [14:31:23] (03PS1) 10Dreamrimmer: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) [14:31:32] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:31:35] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1014526|Add wgAutoCreateTempUser configuration for production (T349506 T337090)]], [[gerrit:1019694|Change mul deployment on beta to limited version (T356169)]] (duration: 30m 12s) [14:31:44] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [14:31:44] T337090: Disallow certain numbers from being generated in the temporary account creation process - https://phabricator.wikimedia.org/T337090 [14:31:45] T356169: MUL - Phased rollout on Wikidata.org (Stage 1 of 3: Test release) - https://phabricator.wikimedia.org/T356169 [14:31:52] jouncebot: nowandnext [14:31:52] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [14:31:52] In 0 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1530) [14:32:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P60512 and previous config saved to /var/cache/conftool/dbconfig/20240415-143217-marostegui.json [14:32:56] Going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1019821 now as it fixes a bug with the previous deploy [14:33:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019821 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [14:33:51] (03Merged) 10jenkins-bot: Define 'useYear' as true for temp user serial mapping config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019821 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [14:34:09] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1019821|Define 'useYear' as true for temp user serial mapping config (T349506)]] [14:35:11] (03CR) 10Fabfur: [C:03+2] haproxy: increase various headers max length [puppet] - 10https://gerrit.wikimedia.org/r/1019760 (https://phabricator.wikimedia.org/T360415) (owner: 10Fabfur) [14:35:17] (03CR) 10Vgutierrez: haproxy: increase various headers max length (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019760 (https://phabricator.wikimedia.org/T360415) (owner: 10Fabfur) [14:35:30] The bug that this fixes is currently causes all new temporary accounts to be treated as registered users, as the usernames do not start with '~2' (because the year is not included at the start). This only affects beta as temporary accounts are not enabled on production, but needs fixing. [14:36:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60513 and previous config saved to /var/cache/conftool/dbconfig/20240415-143614-arnaudb.json [14:36:32] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos/haproxy: enable Benthos logging on all ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1019765 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:36:40] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1019821|Define 'useYear' as true for temp user serial mapping config (T349506)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:44] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [14:37:26] (03PS1) 10Hnowlan: mw-jobrunner: set more php-specific settings to match metal instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019823 (https://phabricator.wikimedia.org/T358308) [14:37:51] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [14:38:29] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:39] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9713681 (10ssingh) >>! In T362533#9713602, @Aitolkyn wrote: > @ssingh Thank you for checking! I get the following error when trying to access my tables: > > > ` > mysql error: SELECT command deni... [14:39:49] (03PS1) 10Ssingh: admin: update expiry_{contact,date} for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1019824 (https://phabricator.wikimedia.org/T362533) [14:40:28] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9713684 (10cmooney) p:05Triage→03Low @papaul yeah I think if we want to go this route we can just set them up the same as w... [14:41:51] !log fixed grants for db2098 [14:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:06] (03CR) 10Clément Goubert: [C:03+1] mw-jobrunner: set more php-specific settings to match metal instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019823 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [14:42:14] (03CR) 10Fabfur: [C:03+2] haproxy: increase various headers max length (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019760 (https://phabricator.wikimedia.org/T360415) (owner: 10Fabfur) [14:42:52] 06SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685#9713700 (10colewhite) a:05colewhite→03None [14:44:51] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/output/1019810/1912/" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [14:47:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P60514 and previous config saved to /var/cache/conftool/dbconfig/20240415-144725-marostegui.json [14:48:39] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T361647 - bking@cumin2002 [14:48:46] T361647: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647 [14:49:09] (03PS1) 10Ssingh: wmf-config: add private subnets for magru [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) [14:49:52] (03CR) 10CI reject: [V:04-1] wmf-config: add private subnets for magru [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:50:26] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1019821|Define 'useYear' as true for temp user serial mapping config (T349506)]] (duration: 16m 16s) [14:50:31] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [14:50:48] (03CR) 10Muehlenhoff: beta::mediawiki_packages: Install lilypond from component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [14:50:49] (03PS2) 10Ssingh: wmf-config: add private subnets for magru [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) [14:51:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60515 and previous config saved to /var/cache/conftool/dbconfig/20240415-145120-arnaudb.json [14:51:32] moritzm: It would seem we need lilypond-fonts as well [14:51:44] nevermind, hnowlan is on it [14:51:52] NMW03: Can you re-schedule your change? [14:52:04] (actually seems that they are no longer in the channel) [14:52:38] BTW the bug on betawikis related to temporary accounts is fixed. [14:52:49] !log Afternoon backport window done [14:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] (03PS3) 10Muehlenhoff: beta::mediawiki_packages: Install lilypond from component [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) [14:53:46] (03CR) 10CI reject: [V:04-1] beta::mediawiki_packages: Install lilypond from component [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [14:58:28] (03PS4) 10Muehlenhoff: beta::mediawiki_packages: Install lilypond from component [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) [14:58:29] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:58] (03CR) 10Ssingh: Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [15:00:20] (03CR) 10Filippo Giunchedi: "o11y bits LGTM, see inline for an addition to alertmanager.yml.erb" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [15:00:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1019824 (https://phabricator.wikimedia.org/T362533) (owner: 10Ssingh) [15:02:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T356166)', diff saved to https://phabricator.wikimedia.org/P60516 and previous config saved to /var/cache/conftool/dbconfig/20240415-150235-marostegui.json [15:02:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:02:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:02:51] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [15:02:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T356166)', diff saved to https://phabricator.wikimedia.org/P60517 and previous config saved to /var/cache/conftool/dbconfig/20240415-150257-marostegui.json [15:03:31] Hi there, and hi Amir1 in particular. Can I go ahead with the query in https://phabricator.wikimedia.org/T362365#9710047 ? [15:03:42] !log dancy@deploy1002 Installing scap version "4.76.0" for 340 hosts [15:03:47] Daimona: sure [15:04:00] Thanks! [15:04:21] !log Running query for T362365#9710047 [15:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:33] !log dancy@deploy1002 Installation of scap version "4.76.0" completed for 340 hosts [15:04:33] T362365: Event registration should not be disabled after marking the event page for translation - https://phabricator.wikimedia.org/T362365 [15:06:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60518 and previous config saved to /var/cache/conftool/dbconfig/20240415-150626-arnaudb.json [15:07:04] (03CR) 10Majavah: "Looks good. Let me know if you need any help deploying this :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:07:09] (03CR) 10Majavah: [C:03+1] wmf-config: add private subnets for magru [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:10:23] (03PS1) 10Peter Fischer: Search update pipeline: update config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019826 [15:10:39] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: update config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019826 (owner: 10Peter Fischer) [15:11:49] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T361647 - bking@cumin2002 [15:11:53] (03PS4) 10Jdlrobson: Enable desktop watchlist on beta cluster, clean up old references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016022 (https://phabricator.wikimedia.org/T109277) [15:11:54] T361647: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647 [15:11:56] (03PS2) 10Pppery: Update links to point to non-wiki privacy policy and bypass redirects [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) [15:12:25] (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:39] (03CR) 10Pppery: Update links to point to non-wiki privacy policy and bypass redirects (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) (owner: 10Pppery) [15:12:55] 06SRE, 10Observability-Logging, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): 14Enable SSO for Kibana - 14https://phabricator.wikimedia.org/T246998#9713868 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi 14I'm optimistically resolving this since logstash.w.o (nowadays opensearch d... [15:13:05] (03Merged) 10jenkins-bot: Search update pipeline: update config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019826 (owner: 10Peter Fischer) [15:13:16] (03PS3) 10Ayounsi: Puppet: add magru [puppet] - 10https://gerrit.wikimedia.org/r/1019810 [15:13:29] (JobUnavailable) firing: (4) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:37] (03CR) 10Ayounsi: Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [15:14:26] (03CR) 10Muehlenhoff: New SSH key validator - Block duplicate keys. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1019271 (https://phabricator.wikimedia.org/T359532) (owner: 10Slyngshede) [15:14:42] 06SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 2 others: 14MediaWiki Prometheus support - 14https://phabricator.wikimedia.org/T240685#9713872 (10colewhite) 05Open→03Resolved a:03colewhite 14This epic will continue in {T343020} [15:15:53] (03CR) 10Ayounsi: wmf-config: add private subnets for magru (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:15:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1006.eqiad.wmnet with OS bullseye [15:16:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9713887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed w... [15:16:29] (03CR) 10Ssingh: wmf-config: add private subnets for magru (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:18:42] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:19:03] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:19:47] (03CR) 10Ayounsi: wmf-config: add private subnets for magru (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:20:06] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 3 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9713923 (10MSantos) [15:21:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60519 and previous config saved to /var/cache/conftool/dbconfig/20240415-152132-arnaudb.json [15:22:22] (03PS1) 10Eevans: sessionstore: test TLS verification (staging) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019827 (https://phabricator.wikimedia.org/T352647) [15:24:35] (03Abandoned) 10DCausse: cirrus-streaming-updater: swith to "failure-rate" retry strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018778 (owner: 10DCausse) [15:24:41] (03PS3) 10Ssingh: wmf-config: add private subnets for magru [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) [15:26:10] (03CR) 10Muehlenhoff: [C:03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1019392 (owner: 10Majavah) [15:26:23] (03CR) 10Majavah: [C:03+2] P:sre: os-updates: fix current OS version names [puppet] - 10https://gerrit.wikimedia.org/r/1019392 (owner: 10Majavah) [15:27:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1019290 (https://phabricator.wikimedia.org/T360636) (owner: 10Hnowlan) [15:27:33] (03PS1) 10Filippo Giunchedi: alertmanager: tweak irc alert message format [puppet] - 10https://gerrit.wikimedia.org/r/1019829 (https://phabricator.wikimedia.org/T362239) [15:27:35] (03CR) 10CI reject: [V:04-1] wmf-config: add private subnets for magru [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:28:26] (03PS4) 10Ssingh: wmf-config: add private subnets for magru [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) [15:29:35] (03CR) 10Elukey: [C:03+1] sessionstore: test TLS verification (staging) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019827 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [15:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1530). [15:36:13] (03CR) 10Eevans: [C:03+2] sessionstore: test TLS verification (staging) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019827 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [15:37:25] (SystemdUnitFailed) resolved: elasticsearch-disable-readahead.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:25] (03Merged) 10jenkins-bot: sessionstore: test TLS verification (staging) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019827 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [15:38:29] (JobUnavailable) resolved: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:18] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [15:40:23] 06SRE, 10SRE-tools: Create a spicerack cookbook for restoring an etcd cluster from backups - https://phabricator.wikimedia.org/T203944#9714078 (10Volans) [15:40:51] 06SRE, 10SRE-tools: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948#9714079 (10Volans) [15:42:38] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters - https://phabricator.wikimedia.org/T315560#9714081 (10Volans) @JMeybohm Is this something still needed? [15:43:07] 06SRE, 10SRE-tools: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943#9714082 (10Volans) [15:43:23] 06SRE, 10SRE-tools: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694#9714083 (10Volans) [15:44:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T356166)', diff saved to https://phabricator.wikimedia.org/P60520 and previous config saved to /var/cache/conftool/dbconfig/20240415-154422-marostegui.json [15:44:32] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [15:44:34] (03CR) 10Herron: [C:03+1] "Thanks, definitely an improvement IMO" [puppet] - 10https://gerrit.wikimedia.org/r/1019829 (https://phabricator.wikimedia.org/T362239) (owner: 10Filippo Giunchedi) [15:45:40] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:45:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:47:02] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9714116 (10Aitolkyn) >>! In T362533#9713681, @ssingh wrote: >>>! In T362533#9713602, @Aitolkyn wrote: >> @ssingh Thank you for checking! I get the following error when trying t... [15:47:26] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters - https://phabricator.wikimedia.org/T315560#9714127 (10JMeybohm) >>! In T315560#9714081, @Volans wrote: > @JMeybohm Is this something still needed? Not ult... [15:48:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:48:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:50:30] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [15:51:03] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:51:07] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:52:36] (03PS1) 10Andrew Bogott: profile::wmcs::backup_cinder_volumes: remove remove-unhandled-backups timer [puppet] - 10https://gerrit.wikimedia.org/r/1019832 [15:53:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:53:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:54:17] (03CR) 10Alexandros Kosiaris: [C:03+1] "+1 cause it's already an improvement." [puppet] - 10https://gerrit.wikimedia.org/r/1019829 (https://phabricator.wikimedia.org/T362239) (owner: 10Filippo Giunchedi) [15:55:16] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1001.eqiad.wmnet with OS bullseye [15:55:17] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ncredir1001.eqiad.wmnet with OS bullseye [15:56:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:56:09] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:57:23] (03CR) 10Andrew Bogott: [C:03+2] profile::wmcs::backup_cinder_volumes: remove remove-unhandled-backups timer [puppet] - 10https://gerrit.wikimedia.org/r/1019832 (owner: 10Andrew Bogott) [15:58:05] (03CR) 10Ssingh: Puppet: add magru (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [15:58:23] (03PS4) 10Jdrewniak: [beta] Set Vector 2022 font-size to 16px on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017345 (https://phabricator.wikimedia.org/T360098) [15:58:29] (03CR) 10Ssingh: "Thanks for the patch! Some minor comments in-line, feel free to ignore or defer to me." [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [15:58:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:58:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:58:39] (03CR) 10Jdlrobson: [C:03+1] "Jan: This can be deployed outside a backport window, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017345 (https://phabricator.wikimedia.org/T360098) (owner: 10Jdrewniak) [15:59:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P60521 and previous config saved to /var/cache/conftool/dbconfig/20240415-155932-marostegui.json [16:01:13] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:01:17] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:01:37] 10ops-eqiad, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T362366#9714176 (10cmooney) [16:02:00] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1001.eqiad.wmnet with OS bullseye [16:03:16] (03CR) 10Ssingh: "Sorry I missed this but modules/rancid/files/core/router.db also needs an update?" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [16:03:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:03:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:05:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:05:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:08:09] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:08:13] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:09:23] (03CR) 10Ayounsi: [C:03+1] "lgtm, thx!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [16:10:00] (03CR) 10Ssingh: "I think we should deploy as we go so unless you see a reason not to, please deploy it when you can." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [16:10:41] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T362550 (10phaultfinder) 03NEW [16:10:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:10:53] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:11:49] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage [16:12:36] (03PS1) 10Ebernhardson: cirrus: Enable saneitizer on consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019833 (https://phabricator.wikimedia.org/T358599) [16:13:29] (JobUnavailable) firing: (2) Reduced availability for job lvs_realserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:03] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:14:07] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:14:41] (03CR) 10SBassett: "Hey all - We'd like to get this merged in a week or so. Are there any other serious objections to what has been proposed and implemented " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [16:14:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P60522 and previous config saved to /var/cache/conftool/dbconfig/20240415-161441-marostegui.json [16:14:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage [16:15:01] (03CR) 10Ayounsi: "that will come in a following patch to add network devices to monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [16:15:50] (03CR) 10Ssingh: [C:03+1] "Will follow up with the consistency patch on top of this." [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [16:17:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:17:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:19:52] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:19:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:21:32] (03CR) 10Majavah: "Shouldn't this be in a `.well-known` directory instead of being directly in the domain root?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [16:22:24] (03PS4) 10Dreamrimmer: [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) [16:23:16] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T361647 - bking@cumin2002 [16:23:27] T361647: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647 [16:25:22] (03CR) 10Ssingh: [C:03+2] admin: update expiry_{contact,date} for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1019824 (https://phabricator.wikimedia.org/T362533) (owner: 10Ssingh) [16:28:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:28:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:29:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T356166)', diff saved to https://phabricator.wikimedia.org/P60523 and previous config saved to /var/cache/conftool/dbconfig/20240415-162949-marostegui.json [16:29:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [16:29:54] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [16:30:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [16:30:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T356166)', diff saved to https://phabricator.wikimedia.org/P60524 and previous config saved to /var/cache/conftool/dbconfig/20240415-163011-marostegui.json [16:31:11] 10ops-codfw, 06SRE: 14PowerSupplyFailure - 14https://phabricator.wikimedia.org/T362550#9714420 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm 14reseated blue cable [16:31:54] (03PS1) 10Herron: alertmanager: irc: move group name after summary and clarify count [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) [16:32:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:32:53] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:35:11] (03PS2) 10Herron: alertmanager: irc: move group name after summary and clarify count [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) [16:35:32] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2006-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9714430 (10Jhancock.wm) [16:35:33] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9714431 (10MoritzMuehlenhoff) [16:35:58] (03CR) 10SBassett: "The only thing we really have under .well-known under the projects, AFAICT, is change-password, which is handled as an Apache redirect: ht" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [16:38:14] 10ops-eqiad, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T362366#9714444 (10cmooney) a:03cmooney [16:38:32] (03PS1) 10BCornwall: Set ncredir1001 to use nginx variant "light" [puppet] - 10https://gerrit.wikimedia.org/r/1019842 (https://phabricator.wikimedia.org/T357976) [16:39:16] (03PS1) 10Ssingh: realm: fix consistency for site IPs [puppet] - 10https://gerrit.wikimedia.org/r/1019843 [16:40:43] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019842 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [16:41:40] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2006-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9714481 (10Jhancock.wm) a:03Jhancock.wm @cmooney what is the vlan for this server? racked in : B1-U42 port: 41 [16:43:12] (03PS1) 10Herron: alertmanager: irc: remove runbook and dashboard links from irc alerts [puppet] - 10https://gerrit.wikimedia.org/r/1019844 (https://phabricator.wikimedia.org/T362239) [16:45:47] (03CR) 10Herron: [C:03+1] "FWIW following this up with some additional alert format proposals in I39617c18921c96034d823203d024ac2cb3aaae1e and Ifa71a170f94d786b05fd" [puppet] - 10https://gerrit.wikimedia.org/r/1019829 (https://phabricator.wikimedia.org/T362239) (owner: 10Filippo Giunchedi) [16:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:31] (03CR) 10Ssingh: [C:03+1] Set ncredir1001 to use nginx variant "light" [puppet] - 10https://gerrit.wikimedia.org/r/1019842 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [16:48:54] (03PS3) 10Jcrespo: mariadb: Remove db2101 from services [puppet] - 10https://gerrit.wikimedia.org/r/1019689 (https://phabricator.wikimedia.org/T362311) [16:49:04] (03CR) 10Vgutierrez: [C:03+1] "looks good, maybe mention the variant=custom commit/change ID in the commit message (Ibb008fb49d0d84a61e71976648736cac1c89c66d)" [puppet] - 10https://gerrit.wikimedia.org/r/1019842 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [16:50:17] (03PS1) 10Eevans: Revert "sessionstore: test TLS verification (staging)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019801 [16:50:35] (03CR) 10Elukey: [C:03+1] Revert "sessionstore: test TLS verification (staging)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019801 (owner: 10Eevans) [16:51:18] (03CR) 10Eevans: [C:03+2] Revert "sessionstore: test TLS verification (staging)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019801 (owner: 10Eevans) [16:51:59] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9714556 (10Jhancock.wm) [16:52:21] (03Merged) 10jenkins-bot: Revert "sessionstore: test TLS verification (staging)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019801 (owner: 10Eevans) [16:53:26] (03CR) 10Jcrespo: [C:03+2] mariadb: Remove db2101 from services [puppet] - 10https://gerrit.wikimedia.org/r/1019689 (https://phabricator.wikimedia.org/T362311) (owner: 10Jcrespo) [16:57:19] !log jynus@cumin2002 START - Cookbook sre.hosts.decommission for hosts db2101.codfw.wmnet [16:57:28] (03PS2) 10BCornwall: Set ncredir1001 to use nginx variant "light" [puppet] - 10https://gerrit.wikimedia.org/r/1019842 (https://phabricator.wikimedia.org/T357976) [16:57:48] (03CR) 10Ayounsi: [C:03+1] realm: fix consistency for site IPs [puppet] - 10https://gerrit.wikimedia.org/r/1019843 (owner: 10Ssingh) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1700) [17:00:05] ryankemper: gettimeofday() says it's time for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T1700) [17:00:52] is anyone planning to use the MW infra window? if not, I'll deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1019825 [17:02:06] (03CR) 10Dzahn: [C:03+2] delete cas-logtash.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1019086 (owner: 10Dzahn) [17:02:10] (03PS2) 10Dzahn: delete cas-logtash.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1019086 [17:03:05] !log jynus@cumin2002 START - Cookbook sre.dns.netbox [17:03:22] (03PS1) 10Jcrespo: mariadb: Fully remove db2101 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1019849 (https://phabricator.wikimedia.org/T362311) [17:03:58] (03PS2) 10Jcrespo: mariadb: Fully remove db2101 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1019849 (https://phabricator.wikimedia.org/T362311) [17:04:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [17:05:22] !log jynus@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2101.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin2002" [17:05:46] (03Merged) 10jenkins-bot: wmf-config: add private subnets for magru [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019825 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [17:06:03] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1019825|wmf-config: add private subnets for magru (T346722)]] [17:06:32] !log jynus@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2101.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin2002" [17:06:32] !log jynus@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:06:33] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2101.codfw.wmnet [17:06:46] (03CR) 10Jcrespo: [C:03+2] mariadb: Fully remove db2101 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1019849 (https://phabricator.wikimedia.org/T362311) (owner: 10Jcrespo) [17:08:24] ^ mutante [17:08:32] !log taavi@deploy1002 taavi and sukhe: Backport for [[gerrit:1019825|wmf-config: add private subnets for magru (T346722)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:08:51] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9714676 (10Papaul) @Jhancock.wm cloud-hosts1-b1-codfw (2118) [17:09:08] (not the ticket, the log) [17:10:51] !log taavi@deploy1002 taavi and sukhe: Continuing with sync [17:13:53] !log stop db2139 dbs for upgrade T360751 [17:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:58] T360751: Upgrade backup sources to MariaDB 10.6 - https://phabricator.wikimedia.org/T360751 [17:14:35] (03PS3) 10Jcrespo: mariadb: Upgrade db2139 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1019690 (https://phabricator.wikimedia.org/T360751) [17:17:43] (03CR) 10BCornwall: [C:03+2] Set ncredir1001 to use nginx variant "light" [puppet] - 10https://gerrit.wikimedia.org/r/1019842 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [17:17:53] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db2139 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1019690 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [17:18:31] brett: conglict on puppet [17:19:19] jynus: Feel free to merge my puppet change when you're ready (yours looks more significant and perhaps needing timing) [17:19:52] doing [17:20:42] brett: finished [17:21:08] <3 [17:21:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:36] 10ops-eqiad, 06SRE, 06DC-Ops: 14eqiad: Master Tracking Ticket for eqiad expansion cage - 14https://phabricator.wikimedia.org/T296966#9714716 (10wiki_willy) 05Open→03Resolved 14Since the only thing remaining in this task is bringing up the Dell switches in racks E8 and F8 (which I believe the Network... [17:23:24] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1019825|wmf-config: add private subnets for magru (T346722)]] (duration: 17m 21s) [17:33:29] (JobUnavailable) resolved: (2) Reduced availability for job lvs_realserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:34:01] (03PS1) 10Alexandros Kosiaris: Rename jobrunners to videoscalers [alerts] - 10https://gerrit.wikimedia.org/r/1019852 [17:35:32] (03CR) 10CI reject: [V:04-1] Rename jobrunners to videoscalers [alerts] - 10https://gerrit.wikimedia.org/r/1019852 (owner: 10Alexandros Kosiaris) [17:38:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir1001.eqiad.wmnet with OS bullseye [17:42:54] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T361647 - bking@cumin2002 [17:42:59] T361647: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647 [17:43:57] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1002.eqiad.wmnet with OS bullseye [17:44:55] (03PS1) 10BCornwall: Set ncredir1002 to use nginx variant "light" [puppet] - 10https://gerrit.wikimedia.org/r/1019855 (https://phabricator.wikimedia.org/T357976) [17:45:44] (03CR) 10Ssingh: [C:03+1] Set ncredir1002 to use nginx variant "light" [puppet] - 10https://gerrit.wikimedia.org/r/1019855 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [17:46:56] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019855 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [17:47:48] (03CR) 10BCornwall: [V:03+1 C:03+2] Set ncredir1002 to use nginx variant "light" [puppet] - 10https://gerrit.wikimedia.org/r/1019855 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [17:51:13] (03CR) 10Hnowlan: [C:03+1] "This has been done, safe to merge." [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [17:51:45] 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DBA, and 3 others: Decommission db2101 (was: db2101 crashed) - https://phabricator.wikimedia.org/T362311#9714887 (10jcrespo) [17:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:52:39] 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DBA, and 3 others: Decommission db2101 (was: db2101 crashed) - https://phabricator.wikimedia.org/T362311#9714882 (10jcrespo) a:05jcrespo→03None CC @ABran-WMF in case I missed something. [17:53:35] (03PS1) 10Jdlrobson: Enable night mode on template namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019857 [17:54:09] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9714919 (10Scott_French) @WDoranWMF and @SGupta-WMF, thank you both for the followup. As for image... [17:56:41] (03PS2) 10Ebernhardson: cirrus: Enable saneitizer on consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019833 (https://phabricator.wikimedia.org/T358599) [17:58:06] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage [17:59:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [18:00:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9714961 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [18:00:11] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:00:15] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:02:11] (03PS2) 10Jcrespo: mariadb: Add dbprov2005 to the grants for s*, x* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1019815 (https://phabricator.wikimedia.org/T362509) [18:02:11] (03PS2) 10Jcrespo: installserver: Setup db and dbprov hosts back to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) [18:02:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage [18:04:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T356166)', diff saved to https://phabricator.wikimedia.org/P60526 and previous config saved to /var/cache/conftool/dbconfig/20240415-180422-marostegui.json [18:04:28] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:10:30] (03PS5) 10Jdlrobson: Enable desktop watchlist on beta cluster, clean up old references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016022 (https://phabricator.wikimedia.org/T109277) [18:12:42] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1006.eqiad.wmnet with reason: host reimage [18:13:52] (03PS1) 10Eevans: sessionstore (staging): enable use of PKI + verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019858 (https://phabricator.wikimedia.org/T352647) [18:15:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1006.eqiad.wmnet with reason: host reimage [18:17:44] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1019086 (owner: 10Dzahn) [18:18:15] 06SRE, 06Infrastructure-Foundations, 10netops: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#9714985 (10cmooney) >>! In T360772#9657554, @ayounsi wrote: > We can define per host hiera keys, and empty lists as well, so to be tested but... [18:19:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P60527 and previous config saved to /var/cache/conftool/dbconfig/20240415-181930-marostegui.json [18:21:21] (03CR) 10Eevans: [C:03+2] sessionstore (staging): enable use of PKI + verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019858 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [18:21:35] (03PS1) 10Herron: mailman: switch HELO checks from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/1019861 (https://phabricator.wikimedia.org/T173338) [18:22:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir1002.eqiad.wmnet with OS bullseye [18:22:25] (03Merged) 10jenkins-bot: sessionstore (staging): enable use of PKI + verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019858 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [18:24:00] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [18:24:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:29:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:34:27] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [18:34:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P60528 and previous config saved to /var/cache/conftool/dbconfig/20240415-183437-marostegui.json [18:36:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1006.eqiad.wmnet with OS bullseye [18:36:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9715042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye completed:... [18:37:25] (03PS1) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [18:42:03] (03PS8) 10Winston Sung: zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427) (owner: 10S8321414) [18:43:11] (03PS3) 10Gehel: query_service: refactoring 'query_service::monitor::updater' [puppet] - 10https://gerrit.wikimedia.org/r/1007933 (https://phabricator.wikimedia.org/T357496) [18:45:05] (03PS1) 10Eevans: Revert "sessionstore (staging): enable use of PKI + verification" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019804 [18:45:28] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir1001.eqiad.wmnet,service=nginx [18:45:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir1002.eqiad.wmnet,service=nginx [18:46:55] (03CR) 10Gehel: [C:03+2] query_service: refactoring 'query_service::monitor::updater' [puppet] - 10https://gerrit.wikimedia.org/r/1007933 (https://phabricator.wikimedia.org/T357496) (owner: 10Gehel) [18:48:57] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:49:15] silencing the probes, this is expected [18:49:29] looki-- aha, thanks <3 [18:49:30] thanks sukhe [18:49:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T356166)', diff saved to https://phabricator.wikimedia.org/P60529 and previous config saved to /var/cache/conftool/dbconfig/20240415-184945-marostegui.json [18:49:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [18:49:50] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:50:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [18:50:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T356166)', diff saved to https://phabricator.wikimedia.org/P60530 and previous config saved to /var/cache/conftool/dbconfig/20240415-185008-marostegui.json [18:50:10] 07sre-alert-triage, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: 14Alert in need of triage: Updater process (instance wdqs1022) - 14https://phabricator.wikimedia.org/T357496#9715082 (10Gehel) 05In progress→03Resolved [18:50:52] resolved in VO [18:51:42] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir1001.eqiad.wmnet,service=nginx [18:51:54] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir1001.eqiad.wmnet,service=nginx [18:52:34] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:08] sorry for the noise folks! [18:53:26] no worries, thanks for jumping on it! [18:53:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T352010)', diff saved to https://phabricator.wikimedia.org/P60531 and previous config saved to /var/cache/conftool/dbconfig/20240415-185334-ladsgroup.json [18:53:39] lmk if you need anything? [18:53:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:53:43] (03PS3) 10Herron: alertmanager: irc: move group name after summary and clarify count [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) [18:54:15] (03PS2) 10Herron: alertmanager: irc: remove runbook and dashboard links from irc alerts [puppet] - 10https://gerrit.wikimedia.org/r/1019844 (https://phabricator.wikimedia.org/T362239) [18:55:10] (03PS1) 10CDanis: add python 3.11 [software/conftool] - 10https://gerrit.wikimedia.org/r/1019876 [18:57:08] (03CR) 10Eevans: [C:03+2] Revert "sessionstore (staging): enable use of PKI + verification" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019804 (owner: 10Eevans) [18:57:34] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:59:32] (03Merged) 10jenkins-bot: Revert "sessionstore (staging): enable use of PKI + verification" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019804 (owner: 10Eevans) [19:01:12] !log deleting unused cas-logstash.wikimedia.org from DNS [19:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:57] (03PS2) 10Dzahn: delete kibana-next.svc.[eqiad|codfw].wmnet records [dns] - 10https://gerrit.wikimedia.org/r/1019087 [19:04:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:04:39] (03CR) 10Dzahn: "tested query and it's an empty set - expected?" [puppet] - 10https://gerrit.wikimedia.org/r/1019715 (https://phabricator.wikimedia.org/T197699) (owner: 10Aklapper) [19:05:58] (03PS1) 10Andrew Bogott: New files/templates for OpenStack Bobcat (2023.2) [puppet] - 10https://gerrit.wikimedia.org/r/1019879 (https://phabricator.wikimedia.org/T356287) [19:06:12] (03CR) 10Dzahn: [C:03+2] phabricator weekly changes email: List Diffusion repository renames [puppet] - 10https://gerrit.wikimedia.org/r/1019715 (https://phabricator.wikimedia.org/T197699) (owner: 10Aklapper) [19:06:39] (03PS2) 10Andrew Bogott: New files/templates for OpenStack Bobcat (2023.2) [puppet] - 10https://gerrit.wikimedia.org/r/1019879 (https://phabricator.wikimedia.org/T356287) [19:07:09] (03CR) 10CI reject: [V:04-1] add python 3.11 [software/conftool] - 10https://gerrit.wikimedia.org/r/1019876 (owner: 10CDanis) [19:07:32] (03CR) 10CI reject: [V:04-1] New files/templates for OpenStack Bobcat (2023.2) [puppet] - 10https://gerrit.wikimedia.org/r/1019879 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [19:08:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P60532 and previous config saved to /var/cache/conftool/dbconfig/20240415-190842-ladsgroup.json [19:12:41] !log deleting unused kibana-next.svc records from DNS - T234854 [19:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:47] T234854: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 [19:13:15] (03CR) 10Dzahn: [C:03+2] delete kibana-next.svc.[eqiad|codfw].wmnet records [dns] - 10https://gerrit.wikimedia.org/r/1019087 (owner: 10Dzahn) [19:14:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:21:30] (03PS2) 10CDanis: add python 3.11 [software/conftool] - 10https://gerrit.wikimedia.org/r/1019876 [19:21:30] (03PS1) 10CDanis: WIP black? [software/conftool] - 10https://gerrit.wikimedia.org/r/1019882 [19:23:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P60533 and previous config saved to /var/cache/conftool/dbconfig/20240415-192350-ladsgroup.json [19:37:14] (03PS3) 10Msz2001: Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019176 (https://phabricator.wikimedia.org/T362414) [19:38:30] (03PS1) 10Dzahn: graphite: avoid including multiple roles, define primary host in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1019885 [19:38:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T352010)', diff saved to https://phabricator.wikimedia.org/P60534 and previous config saved to /var/cache/conftool/dbconfig/20240415-193858-ladsgroup.json [19:39:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [19:39:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:39:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [19:39:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T352010)', diff saved to https://phabricator.wikimedia.org/P60535 and previous config saved to /var/cache/conftool/dbconfig/20240415-193921-ladsgroup.json [19:40:21] (03PS3) 10Ebernhardson: cirrus: Enable saneitizer on consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019833 (https://phabricator.wikimedia.org/T358599) [19:40:49] (03CR) 10Jdrewniak: [C:03+2] "merging this, assuming it'll be deployed with 15 on the beta cluster." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017345 (https://phabricator.wikimedia.org/T360098) (owner: 10Jdrewniak) [19:41:42] (03Merged) 10jenkins-bot: [beta] Set Vector 2022 font-size to 16px on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017345 (https://phabricator.wikimedia.org/T360098) (owner: 10Jdrewniak) [19:44:01] (03PS1) 10Dzahn: graphite: switch envoy ssl provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) [19:44:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T356166)', diff saved to https://phabricator.wikimedia.org/P60536 and previous config saved to /var/cache/conftool/dbconfig/20240415-194420-marostegui.json [19:44:31] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [19:45:04] (03PS1) 10Dzahn: ssl: delete graphite.discovery.wmnet certificate [puppet] - 10https://gerrit.wikimedia.org/r/1019888 (https://phabricator.wikimedia.org/T360414) [19:45:44] (03PS1) 10Dzahn: delete graphite.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1019889 (https://phabricator.wikimedia.org/T360414) [19:46:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9715364 (10Jclark-ctr) Replaced dac cable and reimaged @jcrespo looks like it resolved issue [19:46:53] (03PS4) 10Ebernhardson: cirrus: Enable saneitizer on consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019833 (https://phabricator.wikimedia.org/T358599) [19:47:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: 14Q3:rack/setup/install dbprov100[56] - 14https://phabricator.wikimedia.org/T355353#9715365 (10Jclark-ctr) 05Open→03Resolved [19:47:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: 14Q3:rack/setup/install dbprov100[56] - 14https://phabricator.wikimedia.org/T355353#9715367 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [19:53:48] (03CR) 10AOkoth: prometheus: puppetise sql_exporter (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [19:59:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P60537 and previous config saved to /var/cache/conftool/dbconfig/20240415-195928-marostegui.json [19:59:48] (03CR) 10Ebernhardson: [C:03+2] cirrus: Enable saneitizer on consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019833 (https://phabricator.wikimedia.org/T358599) (owner: 10Ebernhardson) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T2000). nyaa~ [20:00:05] Jdlrobson and Winston_Sung: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:45] o/ [20:02:26] (03Merged) 10jenkins-bot: cirrus: Enable saneitizer on consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019833 (https://phabricator.wikimedia.org/T358599) (owner: 10Ebernhardson) [20:02:56] :wave: [20:03:03] I might be deploy if no one else can [20:04:03] OK I can deploy! [20:05:28] Jdlrobson: we'll do yours first :D [20:05:43] !log staring UTC late backport window [20:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:23] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:06:35] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:07:19] kindrobot: thanks [20:12:19] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:12:25] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:13:30] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:13:34] Jdlrobson: can you confirm that this is the second change you wanted? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1019857 [20:13:34] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:13:49] Is different in the deployment queue log [20:13:50] kindrobot: correct [20:13:52] Great [20:14:01] they can go out together though - no big deal [20:14:10] Preparing to deploy, will deploy them together [20:14:11] (no ordering required) [20:14:24] 🫡 [20:14:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P60538 and previous config saved to /var/cache/conftool/dbconfig/20240415-201436-marostegui.json [20:16:02] (03PS1) 10MusikAnimal: [mediawikiwiki] enable CodeMirror V6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019893 (https://phabricator.wikimedia.org/T357795) [20:16:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016022 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [20:16:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019857 (owner: 10Jdlrobson) [20:17:51] (03Merged) 10jenkins-bot: Enable desktop watchlist on beta cluster, clean up old references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016022 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [20:17:55] (03Merged) 10jenkins-bot: Enable night mode on template namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019857 (owner: 10Jdlrobson) [20:19:10] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:19:13] (03PS9) 10Winston Sung: zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427) (owner: 10S8321414) [20:19:17] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:19:20] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:1016022|Enable desktop watchlist on beta cluster, clean up old references (T109277)]], [[gerrit:1019857|Enable night mode on template namespace]] [20:19:25] T109277: [EPIC]: Use core watchlist code for mobile experience - https://phabricator.wikimedia.org/T109277 [20:19:48] (03PS2) 10CDanis: Fix nuisance black diffs [software/conftool] - 10https://gerrit.wikimedia.org/r/1019882 [20:21:56] !log kindrobot@deploy1002 jdlrobson and kindrobot: Backport for [[gerrit:1016022|Enable desktop watchlist on beta cluster, clean up old references (T109277)]], [[gerrit:1019857|Enable night mode on template namespace]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:22:32] Jdlrobson: deployed on test servers; can you confirm changes? [20:22:50] kindrobot: looking [20:22:53] <3 [20:23:25] kindrobot: yep that's working! Please sync and thanks! [20:23:42] Great, syncing [20:24:06] !log kindrobot@deploy1002 jdlrobson and kindrobot: Continuing with sync [20:26:12] (03PS3) 10CDanis: add python 3.11 [software/conftool] - 10https://gerrit.wikimedia.org/r/1019876 [20:29:42] (03CR) 10CDanis: [C:03+2] add python 3.11 [software/conftool] - 10https://gerrit.wikimedia.org/r/1019876 (owner: 10CDanis) [20:29:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T356166)', diff saved to https://phabricator.wikimedia.org/P60539 and previous config saved to /var/cache/conftool/dbconfig/20240415-202943-marostegui.json [20:29:45] (03CR) 10CDanis: [C:03+2] Fix nuisance black diffs [software/conftool] - 10https://gerrit.wikimedia.org/r/1019882 (owner: 10CDanis) [20:29:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [20:29:48] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [20:30:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [20:32:28] thanks kindrobot for your help today! [20:32:29] (03PS2) 10MusikAnimal: [mediawikiwiki] enable CodeMirror V6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019893 (https://phabricator.wikimedia.org/T357795) [20:32:39] . [20:32:59] (03PS1) 10Andrew Bogott: cinder/bobcat: remove volume_type_access hack [puppet] - 10https://gerrit.wikimedia.org/r/1019894 (https://phabricator.wikimedia.org/T356287) [20:32:59] (03Merged) 10jenkins-bot: Fix nuisance black diffs [software/conftool] - 10https://gerrit.wikimedia.org/r/1019882 (owner: 10CDanis) [20:33:00] (03Merged) 10jenkins-bot: add python 3.11 [software/conftool] - 10https://gerrit.wikimedia.org/r/1019876 (owner: 10CDanis) [20:33:00] (03PS1) 10Andrew Bogott: bobcat cinder: remove backup scheduler hack [puppet] - 10https://gerrit.wikimedia.org/r/1019895 (https://phabricator.wikimedia.org/T356287) [20:33:02] (03PS1) 10Andrew Bogott: cinder/bobcat: removing chunkeddriver.py.patch [puppet] - 10https://gerrit.wikimedia.org/r/1019896 (https://phabricator.wikimedia.org/T356287) [20:33:02] (03PS3) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [20:33:06] (03PS1) 10Andrew Bogott: openstacksdk/bobcat: remove sdk hack about clouds.yaml load ordering [puppet] - 10https://gerrit.wikimedia.org/r/1019897 (https://phabricator.wikimedia.org/T356287) [20:33:10] (03PS1) 10Andrew Bogott: neutron/bobcat: remove an l3 conf override hack [puppet] - 10https://gerrit.wikimedia.org/r/1019898 (https://phabricator.wikimedia.org/T356287) [20:34:14] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:34:18] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [20:34:48] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:34:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: 14Q3:rack/setup/install dbprov100[56] - 14https://phabricator.wikimedia.org/T355353#9715498 (10jcrespo) 14Thank you a lot, to everybody! [20:34:55] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:35:04] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:35:30] ...anyone take a look at the zhwikivoyage config? [20:36:23] (04:35 am UTC+8) [20:36:26] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:1016022|Enable desktop watchlist on beta cluster, clean up old references (T109277)]], [[gerrit:1019857|Enable night mode on template namespace]] (duration: 17m 06s) [20:36:31] T109277: [EPIC]: Use core watchlist code for mobile experience - https://phabricator.wikimedia.org/T109277 [20:36:42] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:36:52] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:36:55] Deployed Jdlrobson, thanks for your service [20:37:04] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:37:06] 🎉 [20:37:09] Winston_Sung: yes, I'll be looking at yours next [20:37:11] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:37:46] (03PS3) 10Andrew Bogott: New files/templates for OpenStack Bobcat (2023.2) [puppet] - 10https://gerrit.wikimedia.org/r/1019879 (https://phabricator.wikimedia.org/T356287) [20:37:46] (03PS2) 10Andrew Bogott: cinder/bobcat: remove volume_type_access hack [puppet] - 10https://gerrit.wikimedia.org/r/1019894 (https://phabricator.wikimedia.org/T356287) [20:37:47] (03PS2) 10Andrew Bogott: bobcat cinder: remove backup scheduler hack [puppet] - 10https://gerrit.wikimedia.org/r/1019895 (https://phabricator.wikimedia.org/T356287) [20:37:47] (03PS2) 10Andrew Bogott: cinder/bobcat: removing chunkeddriver.py.patch [puppet] - 10https://gerrit.wikimedia.org/r/1019896 (https://phabricator.wikimedia.org/T356287) [20:37:48] (03PS2) 10Andrew Bogott: openstacksdk/bobcat: remove sdk hack about clouds.yaml load ordering [puppet] - 10https://gerrit.wikimedia.org/r/1019897 (https://phabricator.wikimedia.org/T356287) [20:37:49] (03PS2) 10Andrew Bogott: neutron/bobcat: remove an l3 conf override hack [puppet] - 10https://gerrit.wikimedia.org/r/1019898 (https://phabricator.wikimedia.org/T356287) [20:38:02] @kindrobot: Thanks. [20:40:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427) (owner: 10S8321414) [20:41:32] 06SRE, 10Data Pipelines, 06Data-Engineering, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9715528 (10mpopov) >>! In T252227#9655162, @dr0ptp4kt wrote: > Okay, if I understand correctly, then the idea would be to... > > 1. Continue "allowing" tag... [20:41:46] (03Merged) 10jenkins-bot: zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427) (owner: 10S8321414) [20:42:04] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:1015551|zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage (T361427)]] [20:42:09] T361427: Make RelatedArticles extension usable on zhwikivoyage - https://phabricator.wikimedia.org/T361427 [20:44:31] !log kindrobot@deploy1002 s8321414 and kindrobot: Backport for [[gerrit:1015551|zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage (T361427)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:44:57] Winston_Sung: on the test servers; can you confirm [20:45:22] Confirming... [20:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:14] Confirmed worked. [20:48:21] Great! Syncing [20:48:24] !log kindrobot@deploy1002 s8321414 and kindrobot: Continuing with sync [20:50:10] Thanks. [21:00:05] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240415T2100). Please do the needful. [21:00:35] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:1015551|zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage (T361427)]] (duration: 18m 30s) [21:00:49] T361427: Make RelatedArticles extension usable on zhwikivoyage - https://phabricator.wikimedia.org/T361427 [21:01:00] Winston_Sung: deployment finished. Thank you for your service [21:01:12] !log closing the UTC late backport window [21:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:05] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:14:26] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:15:27] 06SRE: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#9715602 (10ayounsi) Done :) [21:16:20] 06SRE: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#9715599 (10ayounsi) [21:21:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:31] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move management routers ssh port - https://phabricator.wikimedia.org/T277438#9715614 (10ayounsi) We might have to re-prioritize this task because of {T362522} [21:27:11] (03PS1) 10Ayounsi: Netbox validators: add magru [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1019927 (https://phabricator.wikimedia.org/T362421) [21:30:39] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:30:46] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:37:54] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:38:00] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:39:47] (03CR) 10Ladsgroup: [C:03+1] "Awesome. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1019861 (https://phabricator.wikimedia.org/T173338) (owner: 10Herron) [21:44:11] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:44:16] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:45:45] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:45:50] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:47:17] 06SRE, 10Phabricator, 13Patch-For-Review: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#9715648 (10CodeReviewBot) brennen updated https://gitlab.wikimedia.org/repos/phabricator/extensions/-/merge_requests/30 Remove... [21:48:23] 06SRE, 10Phabricator, 13Patch-For-Review: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#9715653 (10CodeReviewBot) brennen merged https://gitlab.wikimedia.org/repos/phabricator/extensions/-/merge_requests/30 Remove u... [21:48:24] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:48:30] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:09:17] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 19 hosts with reason: T362508 [22:09:24] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [22:09:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 19 hosts with reason: T362508 [22:10:36] (03PS4) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [22:13:42] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [22:44:28] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:44:35] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:50:27] (03PS1) 10Ebernhardson: cirrus: Update container for saneitizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019935 (https://phabricator.wikimedia.org/T358599) [22:52:46] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container for saneitizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019935 (https://phabricator.wikimedia.org/T358599) (owner: 10Ebernhardson) [22:53:40] (03Merged) 10jenkins-bot: cirrus: Update container for saneitizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019935 (https://phabricator.wikimedia.org/T358599) (owner: 10Ebernhardson) [22:57:35] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:57:40] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:17:19] (03PS1) 10Jdlrobson: Thumbnail styles generalized and moved to core [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1019910 (https://phabricator.wikimedia.org/T360388) [23:21:56] (03PS1) 10Jdlrobson: English Wikipedia: Use WikimediaMessages for template overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019941 (https://phabricator.wikimedia.org/T361589) [23:23:03] (03PS2) 10Jdlrobson: Use WikimediaMessages for template overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019941 (https://phabricator.wikimedia.org/T361589) [23:38:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019768 [23:38:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019768 (owner: 10TrainBranchBot) [23:58:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019768 (owner: 10TrainBranchBot)