[00:26:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 848.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:31:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 889.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:55:49] (PuppetFailure) firing: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:11:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:26] (RoutinatorRsyncErrors) resolved: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:34:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017450 [01:34:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017450 (owner: 10TrainBranchBot) [01:34:56] (03PS2) 10Tim Starling: WMCS: Add --quiet option to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1016912 [01:34:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017450 (owner: 10TrainBranchBot) [01:35:12] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033 (10ops-monitoring-bot) 03NEW [02:38:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:41:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:50:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:55:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:58:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:20:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:38:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 897.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:43:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 897.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:01:34] !log Cleaning Prometheus and Thanos-BE log gzips older than 45 days on centrallog1002 [04:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:28] !log Cleaning Prometheus and Thanos-BE log gzips older than 45 days on centrallog2002 [04:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:32:03] (03PS1) 10KartikMistry: Enable the unified dashboard on the test instance for all languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017528 (https://phabricator.wikimedia.org/T360607) [04:32:43] (03CR) 10CI reject: [V:04-1] Enable the unified dashboard on the test instance for all languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017528 (https://phabricator.wikimedia.org/T360607) (owner: 10KartikMistry) [04:55:49] (PuppetFailure) firing: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:11:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:15] (03PS2) 10KartikMistry: Enable the unified dashboard on the test instance for all languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017528 (https://phabricator.wikimedia.org/T360607) [05:57:22] (03CR) 10Samwilson: [C:03+1] Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006181 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [06:05:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1156', diff saved to https://phabricator.wikimedia.org/P59773 and previous config saved to /var/cache/conftool/dbconfig/20240408-060554-root.json [06:06:56] (03PS1) 10Marostegui: db1156: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1017535 (https://phabricator.wikimedia.org/T361543) [06:08:18] (03CR) 10Marostegui: [C:03+2] db1156: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1017535 (https://phabricator.wikimedia.org/T361543) (owner: 10Marostegui) [06:11:20] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1017451 (https://phabricator.wikimedia.org/T362036) [06:11:24] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1017452 (https://phabricator.wikimedia.org/T362036) [06:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:13:43] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1020:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:14:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59774 and previous config saved to /var/cache/conftool/dbconfig/20240408-061413-root.json [06:29:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59775 and previous config saved to /var/cache/conftool/dbconfig/20240408-062919-root.json [06:32:31] (Traffic bill over quota) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:37:31] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:38:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:41:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:44:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59776 and previous config saved to /var/cache/conftool/dbconfig/20240408-064424-root.json [06:45:07] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for IF services (batch two) [puppet] - 10https://gerrit.wikimedia.org/r/1017269 (owner: 10Muehlenhoff) [06:45:27] (03PS1) 10Slyngshede: P:idp Disable Bullseye hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1017602 (https://phabricator.wikimedia.org/T357748) [06:52:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017602 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [06:52:31] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:57:31] (Traffic bill over quota) resolved: (2) Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:59:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59777 and previous config saved to /var/cache/conftool/dbconfig/20240408-065931-root.json [07:00:04] Amir1 and Urbanecm: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T0700). [07:00:04] ihurbain and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:25] hello, world. [07:02:08] Sorry, a bit late. [07:05:28] (03PS4) 10NMW03: Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) [07:06:13] ihurbain: Deploying? [07:06:42] i haven't seen a deployer yet ^^; [07:08:21] ihurbain: I can deploy if you want.. [07:08:31] let's do this then [07:08:57] OK. Starting.. [07:09:01] thank you :) [07:11:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017268 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [07:11:10] (03CR) 10Slyngshede: [C:03+2] P:idp Disable Bullseye hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1017602 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [07:11:53] (03Merged) 10jenkins-bot: Add Kartographer Parsoid support to hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017268 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [07:12:29] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1017268|Add Kartographer Parsoid support to hewikivoyage (T342871 T361025)]] [07:12:33] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [07:12:33] T361025: Deploy parsoid read views for hebrew wikivoyage - https://phabricator.wikimedia.org/T361025 [07:14:19] (03CR) 10Muehlenhoff: [C:03+2] mx: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1017263 (owner: 10Muehlenhoff) [07:14:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59778 and previous config saved to /var/cache/conftool/dbconfig/20240408-071436-root.json [07:15:58] (03PS5) 10NMW03: Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) [07:19:59] (03CR) 10Muehlenhoff: [C:03+2] barbican: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1017261 (owner: 10Muehlenhoff) [07:21:43] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [07:25:26] !log kartik@deploy1002 kartik and ihurbain: Backport for [[gerrit:1017268|Add Kartographer Parsoid support to hewikivoyage (T342871 T361025)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:25:31] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [07:25:32] T361025: Deploy parsoid read views for hebrew wikivoyage - https://phabricator.wikimedia.org/T361025 [07:26:06] isaranto: can you test your config patch on the mwdebug server? [07:26:07] ah :) which mwdebug address should i use? [07:27:33] kart_: which patch are you referring to? [07:27:44] that was for me, sorry - i-tab :P [07:27:55] ok, no prob! :) [07:27:57] ah. Sorry isaranto :) [07:28:20] (03PS1) 10Muehlenhoff: puppetdb::microservice: Use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1017769 [07:28:25] ihurbain: You can use anone from eqiad. [07:28:29] ack [07:28:40] 10ops-codfw, 06SRE, 06DBA, 13Patch-For-Review: 14db2214 crashed - 14https://phabricator.wikimedia.org/T361851#9696278 (10ABran-WMF) 05In progress→03Resolved 14thanks @Jhancock.wm, I'll repool the server and we'll see how it goes [07:29:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017769 (owner: 10Muehlenhoff) [07:29:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59779 and previous config saved to /var/cache/conftool/dbconfig/20240408-072942-root.json [07:30:09] hrmm. [07:30:41] ah [07:31:03] let me render a few pages and we should be good [07:31:11] Sure. [07:31:31] (Traffic bill over quota) firing: (2) Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [07:31:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: provisionning db2214.codfw.wmnet - T355422 [07:31:58] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [07:32:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: provisionning db2214.codfw.wmnet - T355422 [07:32:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: provisionning db2214.codfw.wmnet - T355422 [07:32:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: provisionning db2214.codfw.wmnet - T355422 [07:32:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2114 in db2214 for T355422', diff saved to https://phabricator.wikimedia.org/P59780 and previous config saved to /var/cache/conftool/dbconfig/20240408-073239-arnaudb.json [07:34:03] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [07:34:42] kart_: ship it! :) [07:35:07] ihurbain: cool. [07:35:13] !log kartik@deploy1002 kartik and ihurbain: Continuing with sync [07:35:21] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2114.codfw.wmnet onto db2214.codfw.wmnet [07:36:31] (Traffic bill over quota) firing: (6) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [07:38:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 37 hosts with reason: Primary switchover s1 T361786 [07:38:51] T361786: Switchover s1 master (db2112 -> db2203) - https://phabricator.wikimedia.org/T361786 [07:39:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 37 hosts with reason: Primary switchover s1 T361786 [07:40:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2203 with weight 0 T361786', diff saved to https://phabricator.wikimedia.org/P59781 and previous config saved to /var/cache/conftool/dbconfig/20240408-074006-arnaudb.json [07:44:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59782 and previous config saved to /var/cache/conftool/dbconfig/20240408-074448-root.json [07:47:04] !log installing util-linux security updates on bullseye/bookworm [07:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:12] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1017268|Add Kartographer Parsoid support to hewikivoyage (T342871 T361025)]] (duration: 35m 43s) [07:48:16] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [07:48:16] T361025: Deploy parsoid read views for hebrew wikivoyage - https://phabricator.wikimedia.org/T361025 [07:48:26] ihurbain: Done! [07:48:37] kart_: thank you very much! [07:49:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017528 (https://phabricator.wikimedia.org/T360607) (owner: 10KartikMistry) [07:49:53] (03CR) 10JMeybohm: [C:03+2] Bump calico-kube-controllers memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017316 (https://phabricator.wikimedia.org/T361706) (owner: 10JMeybohm) [07:51:31] (Traffic bill over quota) firing: (6) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [07:52:56] (03Merged) 10jenkins-bot: Bump calico-kube-controllers memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017316 (https://phabricator.wikimedia.org/T361706) (owner: 10JMeybohm) [07:54:11] (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: disable index cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015521 (https://phabricator.wikimedia.org/T344953) (owner: 10Filippo Giunchedi) [07:54:31] (03CR) 10Filippo Giunchedi: [C:03+2] logging: move jaeger index cleanup to curator [puppet] - 10https://gerrit.wikimedia.org/r/1015520 (https://phabricator.wikimedia.org/T344953) (owner: 10Filippo Giunchedi) [07:54:52] (03PS2) 10Jelto: Deprecate system::role for Collaboration services (batch one) [puppet] - 10https://gerrit.wikimedia.org/r/1017274 (owner: 10Muehlenhoff) [07:55:14] (03CR) 10Jelto: Deprecate system::role for Collaboration services (batch one) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017274 (owner: 10Muehlenhoff) [07:55:40] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [07:56:05] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:56:29] 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9696344 (10Volans) The host is alerting in Icinga, should it be downtimed? [07:56:31] (Traffic bill over quota) resolved: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [07:56:43] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:56:59] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [07:57:03] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:57:17] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [07:57:31] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1803/co" [puppet] - 10https://gerrit.wikimedia.org/r/1017274 (owner: 10Muehlenhoff) [07:57:46] (03CR) 10Filippo Giunchedi: [C:03+1] jaeger ui: two week lookback [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017314 (owner: 10CDanis) [08:01:02] (03PS3) 10KartikMistry: Enable the unified dashboard on the test instance for all languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017528 (https://phabricator.wikimedia.org/T360607) [08:01:47] (03CR) 10JMeybohm: [C:03+1] Update kubernetes' svc ipv6 ranges for AUX and DSE [puppet] - 10https://gerrit.wikimedia.org/r/1017311 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [08:01:51] (03CR) 10JMeybohm: [C:03+1] network::data: update all kubesvc's ipv6 ranges [puppet] - 10https://gerrit.wikimedia.org/r/1017312 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [08:03:24] (03CR) 10Muehlenhoff: [C:03+2] Tighten data type for profile::icinga::partners [puppet] - 10https://gerrit.wikimedia.org/r/1017265 (owner: 10Muehlenhoff) [08:06:44] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1016378 (https://phabricator.wikimedia.org/T361786) (owner: 10Gerrit maintenance bot) [08:07:36] (03CR) 10TrainBranchBot: "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017528 (https://phabricator.wikimedia.org/T360607) (owner: 10KartikMistry) [08:08:21] (03Merged) 10jenkins-bot: Enable the unified dashboard on the test instance for all languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017528 (https://phabricator.wikimedia.org/T360607) (owner: 10KartikMistry) [08:08:36] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1017528|Enable the unified dashboard on the test instance for all languages (T360607)]] [08:08:38] !log Starting s1 codfw failover from db2112 to db2203 - T361786 [08:08:41] T360607: Enable the unified dashboard on the test instance for all languages - https://phabricator.wikimedia.org/T360607 [08:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:43] T361786: Switchover s1 master (db2112 -> db2203) - https://phabricator.wikimedia.org/T361786 [08:09:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2203 to s1 primary T361786', diff saved to https://phabricator.wikimedia.org/P59783 and previous config saved to /var/cache/conftool/dbconfig/20240408-080910-arnaudb.json [08:09:45] (03CR) 10Filippo Giunchedi: [C:03+1] "I believe this can move forward, what do you think Matthew?" [alerts] - 10https://gerrit.wikimedia.org/r/1010347 (owner: 10Tim Starling) [08:10:49] !log kartik@deploy1002 kartik: Backport for [[gerrit:1017528|Enable the unified dashboard on the test instance for all languages (T360607)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:11:13] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Collaboration services (batch one) [puppet] - 10https://gerrit.wikimedia.org/r/1017274 (owner: 10Muehlenhoff) [08:12:12] !log restarted stashbot that had died few minutes ago [08:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:37] !log kartik@deploy1002 kartik: Continuing with sync [08:13:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Bump db2112 weight T361786', diff saved to https://phabricator.wikimedia.org/P59784 and previous config saved to /var/cache/conftool/dbconfig/20240408-081320-arnaudb.json [08:13:28] T361786: Switchover s1 master (db2112 -> db2203) - https://phabricator.wikimedia.org/T361786 [08:13:41] (03CR) 10Ayounsi: [C:03+1] installserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1017279 (owner: 10Muehlenhoff) [08:14:10] 10ops-codfw, 06SRE, 10decommission-hardware: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305#9696389 (10dcausse) 05Resolved→03Open Reopening since it seems some of these hosts are still mentioned somewhere. The elastic settings check is complaining with `CRITICAL - [... [08:16:12] (03CR) 10Filippo Giunchedi: "TBH I don't have an opinion on this, though +1 on https://gerrit.wikimedia.org/r/c/operations/alerts/+/1010347" [alerts] - 10https://gerrit.wikimedia.org/r/1008590 (owner: 10Tim Starling) [08:18:46] (03CR) 10Btullis: [C:03+1] "Looks good. Thanks for investigating." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017487 (https://phabricator.wikimedia.org/T361894) (owner: 10Brouberol) [08:19:34] (03CR) 10Brouberol: [C:03+2] superset: ensure role list returned by OIDC server is a list of strings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017487 (https://phabricator.wikimedia.org/T361894) (owner: 10Brouberol) [08:23:35] (03PS1) 10Filippo Giunchedi: aptrepo: update grafana in bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1017776 (https://phabricator.wikimedia.org/T361830) [08:23:42] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [08:23:45] (03CR) 10Volans: [C:03+1] "LGTM" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017358 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [08:24:10] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [08:24:18] (03CR) 10Volans: [C:03+1] "LGTM" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017346 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [08:24:23] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1017528|Enable the unified dashboard on the test instance for all languages (T360607)]] (duration: 15m 47s) [08:24:26] T360607: Enable the unified dashboard on the test instance for all languages - https://phabricator.wikimedia.org/T360607 [08:25:46] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [08:26:13] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [08:26:33] (03PS1) 10JMeybohm: calico: Bump typha memory, make calico memory guaranteed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017777 (https://phabricator.wikimedia.org/T361706) [08:27:49] (03CR) 10JMeybohm: [C:03+2] Update blubberoid chart to mesh.deployment:1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017259 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [08:27:50] (03CR) 10JMeybohm: [C:03+2] Update apertium chart to mesh.deployment:1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017258 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [08:28:19] (03CR) 10Majavah: [C:03+2] P:microsites: fix SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/1013968 (owner: 10Majavah) [08:28:56] (03Merged) 10jenkins-bot: Update blubberoid chart to mesh.deployment:1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017259 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [08:29:02] (03Merged) 10jenkins-bot: Update apertium chart to mesh.deployment:1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017258 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [08:29:21] !log restarting blazegraph on wdqs1020 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1017776 (https://phabricator.wikimedia.org/T361830) (owner: 10Filippo Giunchedi) [08:30:57] (03CR) 10Filippo Giunchedi: [C:03+2] aptrepo: update grafana in bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1017776 (https://phabricator.wikimedia.org/T361830) (owner: 10Filippo Giunchedi) [08:33:04] (03CR) 10MVernon: [C:03+1] "Mmm, seems sensible to me. Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1010347 (owner: 10Tim Starling) [08:34:19] (03CR) 10MVernon: [C:03+1] "I think this will be useful, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1008590 (owner: 10Tim Starling) [08:35:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db2214.codfw.wmnet [08:35:30] (03CR) 10Tim Starling: [C:03+2] SwiftTooManyMediaUploads: use subtraction instead of increase() [alerts] - 10https://gerrit.wikimedia.org/r/1008590 (owner: 10Tim Starling) [08:35:34] (03CR) 10Tim Starling: [C:03+2] SwiftTooManyMediaUploads: reduce severity [alerts] - 10https://gerrit.wikimedia.org/r/1010347 (owner: 10Tim Starling) [08:37:00] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [08:37:22] (03CR) 10Clément Goubert: [C:03+1] calico: Bump typha memory, make calico memory guaranteed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017777 (https://phabricator.wikimedia.org/T361706) (owner: 10JMeybohm) [08:37:30] (03Merged) 10jenkins-bot: SwiftTooManyMediaUploads: use subtraction instead of increase() [alerts] - 10https://gerrit.wikimedia.org/r/1008590 (owner: 10Tim Starling) [08:37:42] (03CR) 10JMeybohm: [C:03+2] calico: Bump typha memory, make calico memory guaranteed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017777 (https://phabricator.wikimedia.org/T361706) (owner: 10JMeybohm) [08:38:11] (03Merged) 10jenkins-bot: SwiftTooManyMediaUploads: reduce severity [alerts] - 10https://gerrit.wikimedia.org/r/1010347 (owner: 10Tim Starling) [08:38:43] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1020:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:39:06] jouncebot: next [08:39:06] In 1 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T1000) [08:39:18] I'll upgrade grafana.w.o shortly [08:40:45] (03Merged) 10jenkins-bot: calico: Bump typha memory, make calico memory guaranteed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017777 (https://phabricator.wikimedia.org/T361706) (owner: 10JMeybohm) [08:41:10] !log grafana upgrade to 9.5.18 - T361830 [08:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::sanitarium_multiinstance [08:45:20] (03CR) 10Volans: "Thanks for the fixes, just run:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [08:46:27] (03PS1) 10Muehlenhoff: Switch mariadb::sanitarium_multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017778 (https://phabricator.wikimedia.org/T349619) [08:55:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 25%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P59785 and previous config saved to /var/cache/conftool/dbconfig/20240408-085545-arnaudb.json [08:55:49] (PuppetFailure) firing: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:56:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [08:57:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [08:57:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59786 and previous config saved to /var/cache/conftool/dbconfig/20240408-085708-arnaudb.json [08:57:11] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [08:57:20] 06SRE, 10Cloud-VPS, 10DNS, 06Traffic: DNS name resolution failure with www.spacecom.mil from Cloud VPS - https://phabricator.wikimedia.org/T346471#9696527 (10taavi) `www.spacecom.mil` seems to work now: `lang=shell-session taavi@tools-bastion-12:~ $ dig www.spacecom.mil ; <<>> DiG 9.18.24-1-Debian <<>> ww... [08:57:24] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2197 [puppet] - 10https://gerrit.wikimedia.org/r/1017779 (https://phabricator.wikimedia.org/T355422) [08:58:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:58:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:59:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59787 and previous config saved to /var/cache/conftool/dbconfig/20240408-085924-arnaudb.json [09:00:52] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::sanitarium_multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017778 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:02:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:02:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:02:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:02:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:02:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T360332)', diff saved to https://phabricator.wikimedia.org/P59788 and previous config saved to /var/cache/conftool/dbconfig/20240408-090258-arnaudb.json [09:03:15] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:05:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T360332)', diff saved to https://phabricator.wikimedia.org/P59789 and previous config saved to /var/cache/conftool/dbconfig/20240408-090535-arnaudb.json [09:06:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::sanitarium_multiinstance [09:10:06] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9696567 (10MoritzMuehlenhoff) [09:10:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 50%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P59790 and previous config saved to /var/cache/conftool/dbconfig/20240408-091051-arnaudb.json [09:10:54] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 15830 [09:11:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59791 and previous config saved to /var/cache/conftool/dbconfig/20240408-091432-arnaudb.json [09:14:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15830 [09:16:10] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:16:42] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:16:57] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:17:30] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:17:58] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [09:19:43] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9696600 (10ayounsi) Thanks. What I don't understand is that if they go through ZTP or manual basic setup, they will by definiti... [09:20:34] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [09:20:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P59792 and previous config saved to /var/cache/conftool/dbconfig/20240408-092045-arnaudb.json [09:21:06] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply [09:21:38] (03PS1) 10Filippo Giunchedi: node-exporter: ignore run/credentials mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/1017784 [09:22:20] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply [09:25:53] (03CR) 10Ayounsi: "Nicely done !" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) (owner: 10Cathal Mooney) [09:25:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 75%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P59793 and previous config saved to /var/cache/conftool/dbconfig/20240408-092557-arnaudb.json [09:29:09] (03PS6) 10Elukey: role::aqs: deploy the PKI-enabled TLS bundle and use it on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) [09:29:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59794 and previous config saved to /var/cache/conftool/dbconfig/20240408-092939-arnaudb.json [09:32:52] (03PS1) 10Filippo Giunchedi: titan: trim 5m retention to 4y + 1w [puppet] - 10https://gerrit.wikimedia.org/r/1017806 (https://phabricator.wikimedia.org/T351927) [09:35:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P59795 and previous config saved to /var/cache/conftool/dbconfig/20240408-093552-arnaudb.json [09:37:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::proxy::master [09:39:02] (03PS1) 10Muehlenhoff: Switch mariadb::proxy::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017807 (https://phabricator.wikimedia.org/T349619) [09:41:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 100%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P59796 and previous config saved to /var/cache/conftool/dbconfig/20240408-094102-arnaudb.json [09:43:19] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [09:43:55] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [09:44:01] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply [09:44:25] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::proxy::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017807 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:44:30] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [09:44:37] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [09:44:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59797 and previous config saved to /var/cache/conftool/dbconfig/20240408-094447-arnaudb.json [09:44:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:44:50] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:45:01] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [09:45:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:45:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T360332)', diff saved to https://phabricator.wikimedia.org/P59798 and previous config saved to /var/cache/conftool/dbconfig/20240408-094510-arnaudb.json [09:45:18] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [09:45:41] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [09:46:04] !log jgiannelos@deploy1002 Started deploy [restbase/deploy@c4d19d7]: (no justification provided) [09:47:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T360332)', diff saved to https://phabricator.wikimedia.org/P59799 and previous config saved to /var/cache/conftool/dbconfig/20240408-094726-arnaudb.json [09:49:53] !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@c4d19d7]: (no justification provided) (duration: 03m 49s) [09:51:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T360332)', diff saved to https://phabricator.wikimedia.org/P59800 and previous config saved to /var/cache/conftool/dbconfig/20240408-095100-arnaudb.json [09:51:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [09:51:03] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:51:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [09:51:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T360332)', diff saved to https://phabricator.wikimedia.org/P59801 and previous config saved to /var/cache/conftool/dbconfig/20240408-095123-arnaudb.json [09:51:55] (03CR) 10Btullis: [C:03+1] role::aqs: deploy the PKI-enabled TLS bundle and use it on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [09:53:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::proxy::master [09:53:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T360332)', diff saved to https://phabricator.wikimedia.org/P59802 and previous config saved to /var/cache/conftool/dbconfig/20240408-095359-arnaudb.json [09:55:23] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9696705 (10MoritzMuehlenhoff) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T1000) [10:02:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59803 and previous config saved to /var/cache/conftool/dbconfig/20240408-100233-arnaudb.json [10:03:30] (ProbeDown) firing: (3) Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:07:56] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2128.codfw.wmnet [10:08:30] (ProbeDown) resolved: (6) Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P59804 and previous config saved to /var/cache/conftool/dbconfig/20240408-100906-arnaudb.json [10:09:21] (03PS1) 10Muehlenhoff: Switch db2128 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017810 (https://phabricator.wikimedia.org/T349619) [10:13:00] (03CR) 10Muehlenhoff: [C:03+2] Switch db2128 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017810 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:17:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59805 and previous config saved to /var/cache/conftool/dbconfig/20240408-101741-arnaudb.json [10:18:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2128.codfw.wmnet [10:24:03] !log Starting MediaModeration scanning script (stopped over the weekend due to server instability) - https://wikitech.wikimedia.org/wiki/MediaModeration [10:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P59806 and previous config saved to /var/cache/conftool/dbconfig/20240408-102414-arnaudb.json [10:32:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T360332)', diff saved to https://phabricator.wikimedia.org/P59807 and previous config saved to /var/cache/conftool/dbconfig/20240408-103249-arnaudb.json [10:32:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [10:32:55] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:33:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [10:33:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59808 and previous config saved to /var/cache/conftool/dbconfig/20240408-103313-arnaudb.json [10:35:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59809 and previous config saved to /var/cache/conftool/dbconfig/20240408-103529-arnaudb.json [10:38:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:39:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T360332)', diff saved to https://phabricator.wikimedia.org/P59810 and previous config saved to /var/cache/conftool/dbconfig/20240408-103922-arnaudb.json [10:39:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:39:26] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:39:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:39:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T360332)', diff saved to https://phabricator.wikimedia.org/P59811 and previous config saved to /var/cache/conftool/dbconfig/20240408-103945-arnaudb.json [10:41:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:42:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T360332)', diff saved to https://phabricator.wikimedia.org/P59812 and previous config saved to /var/cache/conftool/dbconfig/20240408-104221-arnaudb.json [10:42:42] (03PS1) 10Hnowlan: mw-jobrunner: bump max_execution_time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017819 (https://phabricator.wikimedia.org/T358308) [10:44:51] (03CR) 10JMeybohm: [C:03+1] mw-jobrunner: bump max_execution_time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017819 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [10:49:05] !log installing postgresql-13 security updates [10:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59813 and previous config saved to /var/cache/conftool/dbconfig/20240408-105036-arnaudb.json [10:51:42] !log Starting scan on dewiki for MediaModeration to catch-up on monthly limits - https://wikitech.wikimedia.org/wiki/MediaModeration [10:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable abusefilter block at bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [10:57:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P59814 and previous config saved to /var/cache/conftool/dbconfig/20240408-105729-arnaudb.json [11:01:57] (03CR) 10Hnowlan: [C:03+2] mw-jobrunner: bump max_execution_time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017819 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [11:02:27] (03CR) 10Clément Goubert: [C:03+1] "LGTM, just needs a rebase because of the weekly rebuild" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:02:55] (03Merged) 10jenkins-bot: mw-jobrunner: bump max_execution_time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017819 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [11:03:02] (03PS6) 10TChin: [WIP] Add datasets-config helm chart and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [11:03:53] !log started manual wikidata dump on snapshot1009 for T252396 [11:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:56] T252396: Split page-meta-history wikidata dump job across multiple hosts - https://phabricator.wikimedia.org/T252396 [11:05:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59815 and previous config saved to /var/cache/conftool/dbconfig/20240408-110545-arnaudb.json [11:09:25] (03PS1) 10EoghanGaffney: [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) [11:09:54] (03CR) 10CI reject: [V:04-1] [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [11:09:56] !log hnowlan@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [11:09:56] !log hnowlan@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [11:10:33] (03PS3) 10Jon Harald Søby: Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) [11:10:39] !log hnowlan@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:10:40] (03PS2) 10EoghanGaffney: [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) [11:11:09] (03CR) 10CI reject: [V:04-1] [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [11:11:38] !log hnowlan@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:12:05] !log hnowlan@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [11:12:05] !log hnowlan@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [11:12:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P59816 and previous config saved to /var/cache/conftool/dbconfig/20240408-111236-arnaudb.json [11:12:54] !log hnowlan@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:13:52] !log hnowlan@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:14:46] (03PS3) 10EoghanGaffney: [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) [11:15:16] (03CR) 10CI reject: [V:04-1] [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [11:16:51] (03PS4) 10EoghanGaffney: [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) [11:20:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59817 and previous config saved to /var/cache/conftool/dbconfig/20240408-112052-arnaudb.json [11:20:58] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [11:21:25] (SystemdUnitFailed) firing: netbox_report_cables_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T360332)', diff saved to https://phabricator.wikimedia.org/P59818 and previous config saved to /var/cache/conftool/dbconfig/20240408-112744-arnaudb.json [11:27:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [11:27:48] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [11:28:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [11:28:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T360332)', diff saved to https://phabricator.wikimedia.org/P59819 and previous config saved to /var/cache/conftool/dbconfig/20240408-112807-arnaudb.json [11:30:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T360332)', diff saved to https://phabricator.wikimedia.org/P59820 and previous config saved to /var/cache/conftool/dbconfig/20240408-113045-arnaudb.json [11:31:25] (SystemdUnitFailed) resolved: netbox_report_cables_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:35:58] !log installing glibc security updates on bullseye [11:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1029.eqiad.wmnet [11:43:29] (03PS1) 10Muehlenhoff: Switch es1029 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017826 (https://phabricator.wikimedia.org/T349619) [11:44:31] (03CR) 10Muehlenhoff: [C:03+2] Switch es1029 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017826 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:45:40] (03CR) 10Arnaudb: [C:03+2] mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [11:45:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P59821 and previous config saved to /var/cache/conftool/dbconfig/20240408-114552-arnaudb.json [11:49:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1029.eqiad.wmnet [11:49:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1030.eqiad.wmnet [11:50:41] (03PS1) 10Muehlenhoff: Switch es1030 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017827 (https://phabricator.wikimedia.org/T349619) [11:53:24] (03CR) 10Muehlenhoff: [C:03+2] Switch es1030 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017827 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:53:28] (03CR) 10Urbanecm: [C:04-1] "idea looks good to me, but see inline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) (owner: 10Jon Harald Søby) [11:56:37] (03CR) 10JMeybohm: [C:03+2] Remove flink RBAC snowflakes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015343 (https://phabricator.wikimedia.org/T326409) (owner: 10JMeybohm) [11:56:40] (03CR) 10JMeybohm: [C:03+2] admin/namespaces: Remove net.beta.kubernetes.io/network-policy annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015329 (owner: 10JMeybohm) [11:57:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1030.eqiad.wmnet [11:57:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1031.eqiad.wmnet [11:58:55] (03PS3) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [11:59:45] (03Merged) 10jenkins-bot: admin/namespaces: Remove net.beta.kubernetes.io/network-policy annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015329 (owner: 10JMeybohm) [11:59:47] (03Merged) 10jenkins-bot: Remove flink RBAC snowflakes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015343 (https://phabricator.wikimedia.org/T326409) (owner: 10JMeybohm) [12:00:09] (03PS1) 10Muehlenhoff: Switch es1031 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017829 (https://phabricator.wikimedia.org/T349619) [12:01:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P59822 and previous config saved to /var/cache/conftool/dbconfig/20240408-120101-arnaudb.json [12:01:30] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:02:00] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host stat1011.eqiad.wmnet [12:02:43] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host stat1011.eqiad.wmnet [12:02:59] (03CR) 10Muehlenhoff: [C:03+2] Switch es1031 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017829 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:03:13] (03CR) 10CI reject: [V:04-1] Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [12:04:16] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:05:59] (03PS1) 10Arnaudb: mariadb: hotfix mysqld-exporter-config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1017456 (https://phabricator.wikimedia.org/T327384) [12:06:13] (03PS4) 10S8321414: zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427) [12:07:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1031.eqiad.wmnet [12:08:40] (03CR) 10Kormat: [C:03+1] mariadb: hotfix mysqld-exporter-config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1017456 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [12:08:46] 06SRE, 10Maps: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9697097 (10MoritzMuehlenhoff) [12:09:23] (03CR) 10Arnaudb: [C:03+2] mariadb: hotfix mysqld-exporter-config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1017456 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [12:09:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1022.eqiad.wmnet [12:10:52] (03PS1) 10Muehlenhoff: Switch es1022 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017833 (https://phabricator.wikimedia.org/T349619) [12:11:38] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:12:17] 06SRE: Phase out cergen for Fundraising services - https://phabricator.wikimedia.org/T360779#9697100 (10MoritzMuehlenhoff) [12:12:24] (03CR) 10Btullis: Create a new aqs-http-gateway chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [12:12:30] (03CR) 10Muehlenhoff: [C:03+2] Switch es1022 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017833 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:13:13] (03CR) 10Dreamy Jazz: [C:03+1] IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017152 (https://phabricator.wikimedia.org/T361884) (owner: 10Tchanders) [12:14:20] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:15:04] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:16:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T360332)', diff saved to https://phabricator.wikimedia.org/P59823 and previous config saved to /var/cache/conftool/dbconfig/20240408-121609-arnaudb.json [12:16:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [12:16:12] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:16:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [12:16:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T360332)', diff saved to https://phabricator.wikimedia.org/P59824 and previous config saved to /var/cache/conftool/dbconfig/20240408-121642-arnaudb.json [12:17:45] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:17:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1022.eqiad.wmnet [12:17:58] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1237.eqiad.wmnet [12:19:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T360332)', diff saved to https://phabricator.wikimedia.org/P59825 and previous config saved to /var/cache/conftool/dbconfig/20240408-121920-arnaudb.json [12:19:56] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:20:09] (03PS1) 10Muehlenhoff: Switch db1237 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017839 (https://phabricator.wikimedia.org/T349619) [12:21:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:22:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:22:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T360332)', diff saved to https://phabricator.wikimedia.org/P59826 and previous config saved to /var/cache/conftool/dbconfig/20240408-122209-arnaudb.json [12:22:14] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:22:43] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:23:56] (03CR) 10Muehlenhoff: [C:03+2] Switch db1237 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017839 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:25:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T360332)', diff saved to https://phabricator.wikimedia.org/P59827 and previous config saved to /var/cache/conftool/dbconfig/20240408-122527-arnaudb.json [12:26:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017784 (owner: 10Filippo Giunchedi) [12:27:04] (03CR) 10DCausse: "the Wikidata extension should be deployed and this patch should be ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [12:27:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1237.eqiad.wmnet [12:28:19] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:28:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1226.eqiad.wmnet [12:29:57] (03PS1) 10Muehlenhoff: Switch db1226 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017843 (https://phabricator.wikimedia.org/T349619) [12:30:12] (03PS2) 10Muehlenhoff: Switch db1226 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017843 (https://phabricator.wikimedia.org/T349619) [12:31:01] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:31:32] (03PS1) 10JMeybohm: Revert "Remove flink RBAC snowflakes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017789 [12:31:50] (03PS2) 10JMeybohm: Revert "Remove flink RBAC snowflakes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017789 [12:33:05] !log reprepro --component thirdparty/haproxy26 update bullseye-wikimedia [Fetch HAProxy 2.6.17] - T362063 [12:33:07] (03PS3) 10JMeybohm: Revert "Remove flink RBAC snowflakes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017789 (https://phabricator.wikimedia.org/T326409) [12:33:34] (03CR) 10JMeybohm: [V:03+2 C:03+2] Revert "Remove flink RBAC snowflakes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017789 (https://phabricator.wikimedia.org/T326409) (owner: 10JMeybohm) [12:34:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P59828 and previous config saved to /var/cache/conftool/dbconfig/20240408-123427-arnaudb.json [12:34:47] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:35:27] (03PS1) 10Arnaudb: mariadb: prepare future new candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/1017457 (https://phabricator.wikimedia.org/T355422) [12:36:06] (03CR) 10Muehlenhoff: [C:03+2] Switch db1226 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017843 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:37:39] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:39:03] (03CR) 10Kormat: [C:03+2] mariadb: prepare future new candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/1017457 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [12:40:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1226.eqiad.wmnet [12:40:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P59829 and previous config saved to /var/cache/conftool/dbconfig/20240408-124035-arnaudb.json [12:40:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1194.eqiad.wmnet [12:41:53] (03PS1) 10Muehlenhoff: Switch db1194 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017845 (https://phabricator.wikimedia.org/T349619) [12:42:51] (03CR) 10Muehlenhoff: [C:03+2] Switch db1194 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017845 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:44:44] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:45:23] (03CR) 10Elukey: [C:03+2] Update kubernetes' svc ipv6 ranges for AUX and DSE [puppet] - 10https://gerrit.wikimedia.org/r/1017311 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [12:45:31] (03CR) 10Elukey: [C:03+2] network::data: update all kubesvc's ipv6 ranges [puppet] - 10https://gerrit.wikimedia.org/r/1017312 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [12:46:10] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:46:12] (03CR) 10Elukey: [C:03+2] role::aqs: deploy the PKI-enabled TLS bundle and use it on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [12:46:29] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:46:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1194.eqiad.wmnet [12:46:53] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1180.eqiad.wmnet [12:47:14] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:47:52] (03PS1) 10Muehlenhoff: Switch db1180 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017847 (https://phabricator.wikimedia.org/T349619) [12:48:24] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017848 (https://phabricator.wikimedia.org/T361900) [12:49:16] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017848 (https://phabricator.wikimedia.org/T361900) (owner: 10Peter Fischer) [12:49:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P59830 and previous config saved to /var/cache/conftool/dbconfig/20240408-124935-arnaudb.json [12:49:48] (03CR) 10Muehlenhoff: [C:03+2] Switch db1180 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017847 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:50:14] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017848 (https://phabricator.wikimedia.org/T361900) (owner: 10Peter Fischer) [12:51:40] (03PS1) 10Majavah: P:wmcs::metricsinfra::alertmanager: rename project_proxy as api_rw_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1017850 (https://phabricator.wikimedia.org/T362061) [12:51:42] (03PS1) 10Majavah: P:wmcs::metricsinfra::haproxy: drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1017851 [12:51:42] (03PS1) 10Majavah: P:wmcs::metricsinfra::haproxy: move domains to configuration [puppet] - 10https://gerrit.wikimedia.org/r/1017852 [12:51:42] (03PS1) 10Majavah: P:wmcs::metricsinfra::haproxy: add proxy to alertmanager rw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1017853 (https://phabricator.wikimedia.org/T362061) [12:51:44] (03PS1) 10Majavah: P:wmcs::metricsinfra::alertmanager: add basic auth support [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) [12:52:44] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1804/console" [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [12:53:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1180.eqiad.wmnet [12:55:04] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on aqs1010.eqiad.wmnet with reason: Replace Java Truststore [12:55:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aqs1010.eqiad.wmnet with reason: Replace Java Truststore [12:55:25] (SystemdUnitFailed) firing: ferm.service on ml-serve2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P59831 and previous config saved to /var/cache/conftool/dbconfig/20240408-125543-arnaudb.json [12:55:49] (PuppetFailure) firing: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:56:25] (SystemdUnitFailed) firing: (3) ferm.service on mw1485:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:29] !log nodetool-a drain + restart of cassandra instances on aqs1010 to pick up the new truststore [12:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T1300). [13:00:05] Yahya: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:29] :) [13:04:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T360332)', diff saved to https://phabricator.wikimedia.org/P59832 and previous config saved to /var/cache/conftool/dbconfig/20240408-130443-arnaudb.json [13:04:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:04:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:04:48] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:05:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:05:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:05:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T360332)', diff saved to https://phabricator.wikimedia.org/P59833 and previous config saved to /var/cache/conftool/dbconfig/20240408-130543-arnaudb.json [13:05:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1200.eqiad.wmnet [13:06:25] (SystemdUnitFailed) firing: (4) ferm.service on kubernetes1029:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:15] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:07:26] (03PS1) 10Btullis: Prepare dumpsdata100[1-2] for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1017855 (https://phabricator.wikimedia.org/T353787) [13:07:43] (03PS1) 10Muehlenhoff: Switch db1200 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017856 (https://phabricator.wikimedia.org/T349619) [13:08:38] (03CR) 10Muehlenhoff: [C:03+2] Switch db1200 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017856 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:08:40] (03PS1) 10Majavah: P:puuppetserver: export-nodb: fix to use expected directory structure [puppet] - 10https://gerrit.wikimedia.org/r/1017857 [13:09:09] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy falcon7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017858 (https://phabricator.wikimedia.org/T354870) [13:09:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:09:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T360332)', diff saved to https://phabricator.wikimedia.org/P59834 and previous config saved to /var/cache/conftool/dbconfig/20240408-130921-arnaudb.json [13:09:57] o/O [13:10:00] * o/ [13:10:07] I can deploy [13:10:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T360332)', diff saved to https://phabricator.wikimedia.org/P59835 and previous config saved to /var/cache/conftool/dbconfig/20240408-131051-arnaudb.json [13:10:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [13:10:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:11:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [13:11:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T360332)', diff saved to https://phabricator.wikimedia.org/P59836 and previous config saved to /var/cache/conftool/dbconfig/20240408-131113-arnaudb.json [13:11:55] (03CR) 10Andrew Bogott: [C:03+1] P:puuppetserver: export-nodb: fix to use expected directory structure [puppet] - 10https://gerrit.wikimedia.org/r/1017857 (owner: 10Majavah) [13:11:55] (03CR) 10Btullis: [C:03+1] Remove obsolete stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016312 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:12:04] (03CR) 10Majavah: [C:03+2] P:puuppetserver: export-nodb: fix to use expected directory structure [puppet] - 10https://gerrit.wikimedia.org/r/1017857 (owner: 10Majavah) [13:12:15] (03PS5) 10Lucas Werkmeister (WMDE): Enable abusefilter block at bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [13:12:19] (03CR) 10Btullis: [C:03+1] Remove now obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016313 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:12:37] (03CR) 10Btullis: [C:03+1] schema: Remove obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016315 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:12:46] (03CR) 10Btullis: [C:03+1] schema: Remove dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1016316 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:12:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1200.eqiad.wmnet [13:13:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [13:13:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T360332)', diff saved to https://phabricator.wikimedia.org/P59837 and previous config saved to /var/cache/conftool/dbconfig/20240408-131331-arnaudb.json [13:13:59] (03Merged) 10jenkins-bot: Enable abusefilter block at bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [13:14:15] (MediaWikiHighErrorRate) resolved: (3) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:14:17] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1016882|Enable abusefilter block at bnwiki (T361852)]] [13:14:20] T361852: Enable AbuseFilter 'block' on Bengali Wikipedia - https://phabricator.wikimedia.org/T361852 [13:14:27] (03CR) 10David Caro: P:puuppetserver: export-nodb: fix to use expected directory structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017857 (owner: 10Majavah) [13:14:49] (03CR) 10David Caro: P:puuppetserver: export-nodb: fix to use expected directory structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017857 (owner: 10Majavah) [13:14:50] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:15:25] (SystemdUnitFailed) firing: (2) ferm.service on ml-serve2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:33] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:15:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1241.eqiad.wmnet [13:16:04] (03CR) 10Majavah: [C:03+2] P:puuppetserver: export-nodb: fix to use expected directory structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017857 (owner: 10Majavah) [13:16:25] (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:26] (03PS1) 10Majavah: hieradata: add fake metricsinfra irc credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1017859 [13:16:50] (03PS1) 10Muehlenhoff: Switch 1241 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017860 (https://phabricator.wikimedia.org/T349619) [13:17:06] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9697413 (10Papaul) @ayounsi yes you are right since it will have an IP address it will be managed so I was thinking over it. Di... [13:17:12] (03CR) 10Majavah: [V:03+2 C:03+2] hieradata: add fake metricsinfra irc credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1017859 (owner: 10Majavah) [13:17:44] (03CR) 10Muehlenhoff: [C:03+2] Switch 1241 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017860 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:17:46] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and yahya: Backport for [[gerrit:1016882|Enable abusefilter block at bnwiki (T361852)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:18:00] Yahya: can you test the change on WikimediaDebug? [13:18:06] (see https://wikitech.wikimedia.org/wiki/WikimediaDebug) [13:18:07] (03PS2) 10Btullis: Prepare dumpsdata100[1-2] for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1017855 (https://phabricator.wikimedia.org/T353787) [13:18:32] I don’t think I can test it myself without sysop rights [13:21:21] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1807/co" [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [13:21:25] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1241.eqiad.wmnet [13:21:51] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1175.eqiad.wmnet [13:22:50] (03PS1) 10Muehlenhoff: Switch db1175 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017861 (https://phabricator.wikimedia.org/T349619) [13:22:52] (03PS4) 10Btullis: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) [13:22:52] (03PS3) 10Btullis: Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) [13:22:52] (03PS3) 10Btullis: Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) [13:22:53] (03PS3) 10Btullis: Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) [13:22:54] (03PS3) 10Btullis: Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) [13:22:58] (03PS3) 10Btullis: Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) [13:23:02] (03PS3) 10Btullis: Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) [13:23:06] (03PS3) 10Btullis: Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) [13:23:10] (03PS4) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [13:23:13] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[4037,4041,4045,4049].ulsfo.wmnet} and A:cp [13:23:18] (03CR) 10CI reject: [V:04-1] Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:23:41] (03CR) 10Muehlenhoff: [C:03+2] Switch db1175 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017861 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:23:46] (03CR) 10CI reject: [V:04-1] Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:24:03] (03CR) 10CI reject: [V:04-1] Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:24:04] (03CR) 10CI reject: [V:04-1] Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:24:09] (03CR) 10CI reject: [V:04-1] Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:24:26] (03CR) 10CI reject: [V:04-1] Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:24:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P59838 and previous config saved to /var/cache/conftool/dbconfig/20240408-132429-arnaudb.json [13:24:35] (03CR) 10CI reject: [V:04-1] Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:24:41] (03CR) 10Marostegui: mariadb: prepare future new candidate master for s1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017457 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [13:24:43] (03CR) 10CI reject: [V:04-1] Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:24:56] (03CR) 10CI reject: [V:04-1] Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:25:19] (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:25:25] (SystemdUnitFailed) firing: (2) ferm.service on ml-serve2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:25] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2212.codfw.wmnet with reason: Silence for clone [13:26:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2212.codfw.wmnet with reason: Silence for clone [13:27:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1175.eqiad.wmnet [13:28:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1229.eqiad.wmnet [13:28:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P59839 and previous config saved to /var/cache/conftool/dbconfig/20240408-132838-arnaudb.json [13:29:31] (03PS1) 10Muehlenhoff: Switch db1229 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017864 (https://phabricator.wikimedia.org/T349619) [13:30:49] (03CR) 10Muehlenhoff: [C:03+2] Switch db1229 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017864 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:31:25] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:05] Lucas_WMDE Everything looks fine [13:32:15] Sorry for the delay [13:32:31] alright, thanks! [13:32:33] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and yahya: Continuing with sync [13:33:05] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[4037,4041,4045,4049].ulsfo.wmnet} and A:cp [13:34:03] (03PS5) 10Btullis: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) [13:34:03] (03PS4) 10Btullis: Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) [13:34:03] (03PS4) 10Btullis: Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) [13:34:04] (03PS4) 10Btullis: Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) [13:34:05] (03PS4) 10Btullis: Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) [13:34:06] (03PS4) 10Btullis: Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) [13:34:10] (03PS4) 10Btullis: Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) [13:34:14] (03PS4) 10Btullis: Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) [13:34:18] (03PS5) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [13:34:47] (03CR) 10CI reject: [V:04-1] Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:34:53] (03CR) 10CI reject: [V:04-1] Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:34:54] (03CR) 10CI reject: [V:04-1] Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:35:00] (03CR) 10CI reject: [V:04-1] Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:35:21] (03CR) 10CI reject: [V:04-1] Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:35:26] (03CR) 10CI reject: [V:04-1] Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:35:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1229.eqiad.wmnet [13:35:30] (03CR) 10CI reject: [V:04-1] Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:35:52] (03CR) 10CI reject: [V:04-1] Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:36:03] (03CR) 10CI reject: [V:04-1] Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:36:25] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:07] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1207.eqiad.wmnet [13:38:02] (03PS1) 10Muehlenhoff: Switch db1207 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017866 (https://phabricator.wikimedia.org/T349619) [13:39:00] (03CR) 10Muehlenhoff: [C:03+2] Switch db1207 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017866 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:39:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P59840 and previous config saved to /var/cache/conftool/dbconfig/20240408-133936-arnaudb.json [13:39:41] about “how to get progress out of sync-prod-k8s” which was discussed a few days ago [13:39:52] if I run `kube-env mw-web eqiad`, it looks like `kubectl get deployments` gives some reasonably useful information [13:40:25] (SystemdUnitFailed) resolved: (2) ferm.service on ml-serve2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:26] e.g. earlier mw-web.eqiad.main had READY 205/211, UP-TO-DATE 196, AVAILABLE 207 [13:40:46] which I assume means that 196 out of 211 pods(?) had the latest version of the code [13:41:06] (and 205 were accepting travel in total, 9 of them still with the old code) [13:41:11] s/travel/traffic/ lol [13:41:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:42:19] (cc Dreamy_Jazz who was interested in that) [13:43:04] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1016882|Enable abusefilter block at bnwiki (T361852)]] (duration: 28m 46s) [13:43:12] T361852: Enable AbuseFilter 'block' on Bengali Wikipedia - https://phabricator.wikimedia.org/T361852 [13:43:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P59841 and previous config saved to /var/cache/conftool/dbconfig/20240408-134345-arnaudb.json [13:43:52] Yahya: should be deployed everywhere now [13:44:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1207.eqiad.wmnet [13:45:14] !log UTC afternoon backport+config window done [13:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:31] and let me make a phab task for improved progress output in k8s [13:45:44] thanks! [13:45:50] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::alertmanager: rename project_proxy as api_rw_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1017850 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [13:45:59] aha, https://phabricator.wikimedia.org/T361747 already exists :) [13:45:59] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::haproxy: drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1017851 (owner: 10Majavah) [13:46:00] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] schema: Remove dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1016316 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:46:07] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::haproxy: move domains to configuration [puppet] - 10https://gerrit.wikimedia.org/r/1017852 (owner: 10Majavah) [13:46:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:46:25] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:37] (03PS2) 10Majavah: P:wmcs::metricsinfra::haproxy: move domains to configuration [puppet] - 10https://gerrit.wikimedia.org/r/1017852 [13:46:37] (03PS2) 10Majavah: P:wmcs::metricsinfra::haproxy: add proxy to alertmanager rw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1017853 (https://phabricator.wikimedia.org/T362061) [13:46:37] (03PS2) 10Majavah: P:wmcs::metricsinfra::alertmanager: add basic auth support [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) [13:46:41] (03CR) 10Herron: [C:03+1] titan: trim 5m retention to 4y + 1w [puppet] - 10https://gerrit.wikimedia.org/r/1017806 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [13:52:34] (03PS1) 10Elukey: role::aqs: rollout new Cassandra Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1017868 (https://phabricator.wikimedia.org/T352647) [13:53:30] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1017852 (owner: 10Majavah) [13:54:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T360332)', diff saved to https://phabricator.wikimedia.org/P59842 and previous config saved to /var/cache/conftool/dbconfig/20240408-135444-arnaudb.json [13:54:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:54:55] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1017868 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:54:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:55:00] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:55:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:55:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:56:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [13:56:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [13:56:29] (03CR) 10Elukey: [C:03+1] ml-services: deploy falcon7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017858 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [13:57:10] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::haproxy: move domains to configuration [puppet] - 10https://gerrit.wikimedia.org/r/1017852 (owner: 10Majavah) [13:57:18] (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017868 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:57:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [13:57:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [13:58:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T360332)', diff saved to https://phabricator.wikimedia.org/P59843 and previous config saved to /var/cache/conftool/dbconfig/20240408-135852-arnaudb.json [13:58:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [13:58:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [13:59:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [13:59:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [13:59:13] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy falcon7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017858 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [13:59:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T360332)', diff saved to https://phabricator.wikimedia.org/P59844 and previous config saved to /var/cache/conftool/dbconfig/20240408-135915-arnaudb.json [13:59:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2113 (T360332)', diff saved to https://phabricator.wikimedia.org/P59845 and previous config saved to /var/cache/conftool/dbconfig/20240408-135926-arnaudb.json [13:59:58] (03CR) 10Jcrespo: "FYI (for DBAs): These (s2, s6, x1) were recovered using backups taken April 4th's data." [puppet] - 10https://gerrit.wikimedia.org/r/1017779 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [14:00:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:00:27] (03Merged) 10jenkins-bot: ml-services: deploy falcon7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017858 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [14:01:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T360332)', diff saved to https://phabricator.wikimedia.org/P59846 and previous config saved to /var/cache/conftool/dbconfig/20240408-140132-arnaudb.json [14:02:09] (03PS1) 10Arnaudb: mariadb: revert profile::monitoring::notifications_enabled: false [puppet] - 10https://gerrit.wikimedia.org/r/1017458 (https://phabricator.wikimedia.org/T355422) [14:02:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2113 (T360332)', diff saved to https://phabricator.wikimedia.org/P59847 and previous config saved to /var/cache/conftool/dbconfig/20240408-140246-arnaudb.json [14:04:10] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:04:37] (03PS2) 10Elukey: ml-services: update RR ML/Wikidata's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017292 (https://phabricator.wikimedia.org/T360111) [14:05:21] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update RR ML/Wikidata's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017292 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [14:05:42] (03CR) 10Elukey: [C:03+2] ml-services: update RR ML/Wikidata's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017292 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [14:05:48] (03CR) 10Arnaudb: mariadb: prepare future new candidate master for s1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017457 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [14:06:47] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9697644 (10mforns) Hi @Scott_French ! > - When do you anticipate having a minimal binary that succe... [14:06:52] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:12:33] (03CR) 10Arnaudb: [C:03+2] mariadb: revert profile::monitoring::notifications_enabled: false [puppet] - 10https://gerrit.wikimedia.org/r/1017458 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [14:13:31] (03PS1) 10Herron: thanos-rule: stop overriding site and prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) [14:16:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P59849 and previous config saved to /var/cache/conftool/dbconfig/20240408-141641-arnaudb.json [14:17:32] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2113', diff saved to https://phabricator.wikimedia.org/P59850 and previous config saved to /var/cache/conftool/dbconfig/20240408-141753-arnaudb.json [14:18:04] (03CR) 10Brouberol: [C:03+2] mpic: scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017034 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [14:18:05] 06SRE, 10Maps, 06serviceops: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9697722 (10LSobanski) [14:19:17] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:19:47] !log depool cp3069 to prepare for reimaging: T360430 [14:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:50] T360430: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430 [14:20:16] (03CR) 10Elukey: [C:03+1] "LGTM, I may need to fix some grizzly dashboards, but the approach looks sound (although my knowledge of Thanos is minimal so I'd wait for " [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [14:20:42] (03CR) 10Ssingh: [C:03+2] cp3069: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015971 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [14:20:44] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:21:25] (SystemdUnitFailed) firing: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:32] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:23:28] (JobUnavailable) firing: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:24:28] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp3069.esams.wmnet with OS bullseye [14:24:32] (03CR) 10Filippo Giunchedi: "LGTM, see inline for nitpick" [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [14:24:33] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9697733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp3069.esams.wmnet with OS b... [14:25:44] (03PS2) 10Herron: thanos-rule: stop overriding site and prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) [14:27:20] (03CR) 10Filippo Giunchedi: "LGTM, see inline though for syntax fix" [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [14:28:23] (03PS3) 10Herron: thanos-rule: stop overriding site and prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) [14:29:40] mmhh titan1002 isn't happy, I'll take a look [14:30:19] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9697779 (10joanna_borun) [14:30:19] (03CR) 10Herron: thanos-rule: stop overriding site and prometheus labels (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [14:31:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P59851 and previous config saved to /var/cache/conftool/dbconfig/20240408-143149-arnaudb.json [14:32:12] 10SRE-tools, 10Cloud-VPS, 06Infrastructure-Foundations, 10Spicerack: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; e... - https://phabricator.wikimedia.org/T361218#9697786 [14:32:38] 07sre-alert-triage, 06Infrastructure-Foundations: 14Alert in need of triage: SystemdUnitFailed (instance puppetdb2003:9100) - 14https://phabricator.wikimedia.org/T361578#9697791 (10MoritzMuehlenhoff) →14Duplicate dup:03T355924 [14:33:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2113', diff saved to https://phabricator.wikimedia.org/P59852 and previous config saved to /var/cache/conftool/dbconfig/20240408-143301-arnaudb.json [14:33:04] (03CR) 10Jelto: [V:03+1] "lgtm, we could also look at the backup lock." [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [14:33:26] (03CR) 10Jelto: [V:03+1 C:03+1] [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [14:33:30] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9697795 (10joanna_borun) p:05Triage→03Medium a:03CDanis [14:34:33] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9697819 (10ssingh) `cp3069` also did PXE boot successfully, in the first attempt, so it makes it the fourth host in esams to not have an... [14:36:47] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9697825 (10Volans) p:05Triage→03Medium a:03Volans [14:36:54] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9697832 (10joanna_borun) p:05Triage→03Medium [14:37:58] !log bounce thanos-query and thanos-store on titan1002 - stuck on high CPU [14:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:28] (JobUnavailable) firing: (4) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:44] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:38:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:39:33] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:39:37] 06SRE, 10SRE-tools, 06Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#9697839 (10joanna_borun) [14:40:31] (03CR) 10Filippo Giunchedi: [C:03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [14:40:34] !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:41:07] !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:41:25] (03CR) 10Herron: [C:03+2] thanos-rule: stop overriding site and prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/1017873 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [14:41:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:42:07] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:42:45] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:42:52] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:43:15] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:43:25] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:44:00] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:46:51] Lucas_WMDE: Thanks for the information about that. [14:46:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T360332)', diff saved to https://phabricator.wikimedia.org/P59853 and previous config saved to /var/cache/conftool/dbconfig/20240408-144657-arnaudb.json [14:47:00] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:47:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [14:47:01] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:47:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [14:47:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:47:23] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:47:30] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:47:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:47:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T360332)', diff saved to https://phabricator.wikimedia.org/P59854 and previous config saved to /var/cache/conftool/dbconfig/20240408-144738-arnaudb.json [14:47:48] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:48:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2113 (T360332)', diff saved to https://phabricator.wikimedia.org/P59855 and previous config saved to /var/cache/conftool/dbconfig/20240408-144808-arnaudb.json [14:48:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [14:48:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [14:48:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:48:37] 10SRE-tools, 06Infrastructure-Foundations, 10netbox: 14netbox dumps: fix permissions and timestamp - 14https://phabricator.wikimedia.org/T260077#9697877 (10Volans) 05Open→03Resolved a:03Volans 14Since the last update we've removed the Netbox CSV dumps all-together. Resolving [14:48:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:48:45] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3069.esams.wmnet with reason: host reimage [14:48:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T360332)', diff saved to https://phabricator.wikimedia.org/P59856 and previous config saved to /var/cache/conftool/dbconfig/20240408-144847-arnaudb.json [14:51:08] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: IDP failover improvments - https://phabricator.wikimedia.org/T268217#9697885 (10MoritzMuehlenhoff) p:05Medium→03Low [14:51:25] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#9697900 (10joanna_borun) p:05Medium→03Low [14:51:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3069.esams.wmnet with reason: host reimage [14:52:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T360332)', diff saved to https://phabricator.wikimedia.org/P59857 and previous config saved to /var/cache/conftool/dbconfig/20240408-145205-arnaudb.json [14:52:08] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:52:15] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9697903 (10VRiley-WMF) Dell has suggested the following Full reboot (Completed with no change) Flea power drain reboot (Completed with no change) Reseat all cables with another flea power drain reboo... [14:52:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T360332)', diff saved to https://phabricator.wikimedia.org/P59858 and previous config saved to /var/cache/conftool/dbconfig/20240408-145256-arnaudb.json [14:53:41] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Create a PDU spicerack module - https://phabricator.wikimedia.org/T263018#9697915 (10joanna_borun) p:05Medium→03Low [14:54:25] (03PS6) 10Btullis: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) [14:54:25] (03PS5) 10Btullis: Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) [14:54:25] (03PS5) 10Btullis: Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) [14:54:26] (03PS5) 10Btullis: Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) [14:54:30] (03PS5) 10Btullis: Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) [14:54:34] (03PS5) 10Btullis: Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) [14:54:38] (03PS5) 10Btullis: Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) [14:54:42] (03PS5) 10Btullis: Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) [14:54:46] (03PS6) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [14:55:20] 10SRE-tools, 06Infrastructure-Foundations: Netbox accounting report: make it more reliable - https://phabricator.wikimedia.org/T260325#9697927 (10joanna_borun) p:05Medium→03Low [14:55:26] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Generate ssh_known_hosts for network devices - https://phabricator.wikimedia.org/T252747#9697928 (10ayounsi) [14:55:57] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:21] 06SRE, 10SRE-tools, 10Ganeti, 06Infrastructure-Foundations: Cookbook to failover the Ganeti master - https://phabricator.wikimedia.org/T283320#9697930 (10MoritzMuehlenhoff) p:05Medium→03Low [14:56:26] (03CR) 10Btullis: Create a new aqs-http-gateway chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [14:57:32] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:57:42] mmhh I suspect we're back to thanos heavy queries, I'll take a look on titan hosts [14:58:15] 10SRE-tools, 06Infrastructure-Foundations: Debmonitor: backend-changeable settings are stored in the browser's session storage - https://phabricator.wikimedia.org/T240457#9697956 (10joanna_borun) p:05Medium→03Low [14:58:21] volans urandom cwhite ^ [14:58:28] (JobUnavailable) firing: (4) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:13] ack'd [15:00:57] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:03] none of the thanos-fe hosts in eqiad look like they are suffering from a load perspective [15:02:09] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9697981 (10Jclark-ctr) a:03Jclark-ctr [15:02:24] yeah I doubt thanos-fe would be affected, titan1* hosts were [15:02:25] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9697983 (10Marostegui) Thank you! So far the host isn't available for ssh - maybe the whole filesystem is corrupted anyway as it happened here T359940. @VRiley-WMF let me know when I can proceed and re... [15:02:32] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:02] (03CR) 10Elukey: [V:03+1 C:03+2] role::aqs: rollout new Cassandra Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1017868 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:03:04] ahh yep I just connected the dots via the runbook to the titan hosts [15:03:21] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9697991 (10Jclark-ctr) Warranty Expired 19 NOV 2023. Will look to see what drives we have available at data center [15:03:28] (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:33] yeah I'll make that more explicit in the runbooks [15:03:45] godog: did you do any action or did they self-recover as the docs say? [15:03:55] !log dancy@deploy1002 Started deploy [restbase/deploy@c4d19d7]: testing T361608 [15:03:58] T361608: RESTBase scap deployment failed - https://phabricator.wikimedia.org/T361608 [15:04:18] !log kill -9 thanos-store on titan1001 [15:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:27] cwhite: basically ^, though titan1002 recovered by itself [15:04:45] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Switchdc RO/RW: add check to test it editing a real wiki - https://phabricator.wikimedia.org/T163365#9697994 (10joanna_borun) [15:05:25] godog: right on - do you mind if I add that to the runbook? [15:05:38] thanos-query on titan1001 looks unhealthy according to systemctl status, anything still in progress there? [15:06:37] 06SRE, 06collaboration-services, 10Znuny: OTRS spam classification methods and systems - https://phabricator.wikimedia.org/T146968#9698000 (10joanna_borun) [15:07:01] cwhite: thank you, I'll be fixing up the runbooks early tomorrow [15:07:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P59859 and previous config saved to /var/cache/conftool/dbconfig/20240408-150713-arnaudb.json [15:07:23] herron: yeah I just systemctl restart thanos-query just now [15:07:31] should be happier [15:07:47] godog: great thanks [15:08:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P59860 and previous config saved to /var/cache/conftool/dbconfig/20240408-150803-arnaudb.json [15:08:14] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: 14wmf-auto-reimage: 'execution expired' on first puppet run - 14https://phabricator.wikimedia.org/T201317#9698001 (10Volans) 05Open→03Declined 14Too long has passed since then and doesn't seem to happen anymore. [15:08:28] (JobUnavailable) resolved: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:56] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:10:16] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:10:18] 06SRE, 10DNS, 06Infrastructure-Foundations, 10Mail: 14Set up role accounts and feedback loops (FBL) with all providers - 14https://phabricator.wikimedia.org/T106664#9698010 (10joanna_borun) 05Open→03Invalid [15:10:26] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:10:29] 06SRE, 06Infrastructure-Foundations, 10Mail: 14Get mail relay out of Yahoo! blacklist: apply to Yahoo for whitelisting bulk mail - 14https://phabricator.wikimedia.org/T58414#9698012 (10joanna_borun) [15:10:45] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:10:49] (PuppetFailure) resolved: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:10:56] !log drain and restart cassandra-a on aqs1011 to test the new truststore [15:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:58] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:12:06] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:12:07] (03CR) 10Btullis: [C:03+2] Prepare dumpsdata100[1-2] for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1017855 (https://phabricator.wikimedia.org/T353787) (owner: 10Btullis) [15:14:19] (03CR) 10Brouberol: Create a new aqs-http-gateway chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [15:15:35] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Deploy new Truststore - elukey@cumin1002 [15:15:42] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3069.esams.wmnet with OS bullseye [15:15:47] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9698041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp3069.esams.wmnet with OS bulls... [15:17:54] !log dancy@deploy1002 Finished deploy [restbase/deploy@c4d19d7]: testing T361608 (duration: 13m 59s) [15:18:00] T361608: RESTBase scap deployment failed - https://phabricator.wikimedia.org/T361608 [15:20:21] !log Uploaded golang-gitlab-wikimedia-sre-qemutest-dev 0.1.0 to apt.wm.o (bookworm) [15:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:52] (03PS7) 10Btullis: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) [15:20:52] (03PS6) 10Btullis: Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) [15:20:52] (03PS6) 10Btullis: Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) [15:20:53] (03PS6) 10Btullis: Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) [15:20:54] (03PS6) 10Btullis: Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) [15:20:58] (03PS6) 10Btullis: Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) [15:21:02] (03PS6) 10Btullis: Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) [15:21:06] (03PS6) 10Btullis: Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) [15:21:10] (03PS7) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [15:21:18] (03CR) 10Btullis: Create a new aqs-http-gateway chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [15:22:20] (03CR) 10Brouberol: [C:03+1] ":shipit:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [15:22:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P59862 and previous config saved to /var/cache/conftool/dbconfig/20240408-152221-arnaudb.json [15:23:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P59863 and previous config saved to /var/cache/conftool/dbconfig/20240408-152311-arnaudb.json [15:23:28] 06SRE, 06Infrastructure-Foundations, 10Mail: 14Do not apply spam headers on email assessed NOT to be spam - 14https://phabricator.wikimedia.org/T111595#9698089 (10jhathaway) 05Open→03Declined 14@bcampbell setting this to declined, please reopen, if this is still a concern [15:25:52] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts dumpsdata1001.eqiad.wmnet [15:28:36] (03CR) 10Eevans: "Is there an opportunity here to change the name to something that doesn't contain "http-gateway"? I assume the precedent here was _cassand" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [15:30:05] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T1530). [15:31:36] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [15:37:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T360332)', diff saved to https://phabricator.wikimedia.org/P59864 and previous config saved to /var/cache/conftool/dbconfig/20240408-153730-arnaudb.json [15:37:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:37:34] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:37:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:37:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T360332)', diff saved to https://phabricator.wikimedia.org/P59865 and previous config saved to /var/cache/conftool/dbconfig/20240408-153753-arnaudb.json [15:38:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T360332)', diff saved to https://phabricator.wikimedia.org/P59866 and previous config saved to /var/cache/conftool/dbconfig/20240408-153819-arnaudb.json [15:38:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [15:38:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [15:38:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T360332)', diff saved to https://phabricator.wikimedia.org/P59867 and previous config saved to /var/cache/conftool/dbconfig/20240408-153842-arnaudb.json [15:38:58] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dumpsdata1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [15:39:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3069.esams.wmnet,service=(cdn|ats-be) [15:39:26] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9698195 (10ssingh) [15:40:37] 06SRE, 10Znuny: OTRS spam classification methods and systems - https://phabricator.wikimedia.org/T146968#9698200 (10LSobanski) [15:40:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T360332)', diff saved to https://phabricator.wikimedia.org/P59868 and previous config saved to /var/cache/conftool/dbconfig/20240408-154059-arnaudb.json [15:41:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T360332)', diff saved to https://phabricator.wikimedia.org/P59869 and previous config saved to /var/cache/conftool/dbconfig/20240408-154110-arnaudb.json [15:42:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dumpsdata1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [15:42:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dumpsdata1001.eqiad.wmnet [15:43:57] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts dumpsdata1002.eqiad.wmnet [15:45:00] 10ops-eqiad, 10decommission-hardware: decommission dumpsdata1001.eqiad.wmnet - https://phabricator.wikimedia.org/T362064#9698249 (10BTullis) a:05BTullis→03None [15:47:30] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:48:22] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:49:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:54:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:56:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P59870 and previous config saved to /var/cache/conftool/dbconfig/20240408-155606-arnaudb.json [15:56:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P59871 and previous config saved to /var/cache/conftool/dbconfig/20240408-155618-arnaudb.json [15:57:12] (03CR) 10Btullis: "Thanks Eric. I'm all for reducing confusion, but have you got any suggestions?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [15:58:29] 10ops-eqiad, 10decommission-hardware: decommission dumpsdata1002.eqiad.wmnet - https://phabricator.wikimedia.org/T362065#9698304 (10BTullis) a:05BTullis→03None [16:01:08] 06SRE, 06serviceops, 07Epic: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9698325 (10akosiaris) I 'll finish parsoid and testreduce in T359387 [16:01:45] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [16:04:35] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9698336 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/28 Repla... [16:06:54] !log elukey@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:aqs-codfw: Deploy new Truststore - elukey@cumin1002 [16:07:06] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dumpsdata1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [16:11:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P59872 and previous config saved to /var/cache/conftool/dbconfig/20240408-161114-arnaudb.json [16:11:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P59873 and previous config saved to /var/cache/conftool/dbconfig/20240408-161125-arnaudb.json [16:13:23] (03PS1) 10Elukey: profile::prometheus::analytics: remove old metric relabeling [puppet] - 10https://gerrit.wikimedia.org/r/1017882 [16:14:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dumpsdata1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [16:14:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:50] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dumpsdata1002.eqiad.wmnet [16:14:51] !log manually dran + restart cassandra-a on aqs2007 - cookbook failed [16:14:59] 10ops-eqiad, 10decommission-hardware: decommission dumpsdata1002.eqiad.wmnet - https://phabricator.wikimedia.org/T362065#9698379 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1002 for hosts: `dumpsdata1002.eqiad.wmnet` - dumpsdata1002.eqiad.wmnet (**PASS**) - Downtimed host... [16:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:08] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[08-12]*: Deploy new Truststore - elukey@cumin1002 [16:26:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T360332)', diff saved to https://phabricator.wikimedia.org/P59874 and previous config saved to /var/cache/conftool/dbconfig/20240408-162621-arnaudb.json [16:26:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [16:26:30] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:26:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T360332)', diff saved to https://phabricator.wikimedia.org/P59875 and previous config saved to /var/cache/conftool/dbconfig/20240408-162633-arnaudb.json [16:26:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [16:26:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [16:26:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T360332)', diff saved to https://phabricator.wikimedia.org/P59876 and previous config saved to /var/cache/conftool/dbconfig/20240408-162645-arnaudb.json [16:26:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [16:26:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T360332)', diff saved to https://phabricator.wikimedia.org/P59877 and previous config saved to /var/cache/conftool/dbconfig/20240408-162655-arnaudb.json [16:29:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T360332)', diff saved to https://phabricator.wikimedia.org/P59878 and previous config saved to /var/cache/conftool/dbconfig/20240408-162902-arnaudb.json [16:29:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T360332)', diff saved to https://phabricator.wikimedia.org/P59879 and previous config saved to /var/cache/conftool/dbconfig/20240408-162916-arnaudb.json [16:32:13] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:32:18] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:32:29] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:32:37] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:33:08] 10SRE-swift-storage: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844#9698437 (10elukey) @MatthewVernon we could do something like the following: * clean the cert in the Puppet CA - at the point the cert is revoked in the Puppet CA but clients will not check it, sin... [16:39:15] (03CR) 10Eevans: [C:03+1] profile::prometheus::analytics: remove old metric relabeling [puppet] - 10https://gerrit.wikimedia.org/r/1017882 (owner: 10Elukey) [16:44:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P59880 and previous config saved to /var/cache/conftool/dbconfig/20240408-164410-arnaudb.json [16:44:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P59881 and previous config saved to /var/cache/conftool/dbconfig/20240408-164424-arnaudb.json [16:45:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:45:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:45:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T356166)', diff saved to https://phabricator.wikimedia.org/P59882 and previous config saved to /var/cache/conftool/dbconfig/20240408-164524-marostegui.json [16:45:28] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [16:49:27] (03CR) 10Ssingh: "Some comments in-line. Let's talk about running PCC offline." [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [16:50:06] (03PS2) 10Elukey: profile::prometheus::analytics: remove old metric relabeling [puppet] - 10https://gerrit.wikimedia.org/r/1017882 [16:50:07] (03CR) 10Ssingh: purged: add PKI cert handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [16:56:39] (03CR) 10Btullis: [C:03+1] "Looks good, thanks elukey." [puppet] - 10https://gerrit.wikimedia.org/r/1017882 (owner: 10Elukey) [16:57:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs20[08-12]*: Deploy new Truststore - elukey@cumin1002 [16:59:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P59883 and previous config saved to /var/cache/conftool/dbconfig/20240408-165917-arnaudb.json [16:59:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P59884 and previous config saved to /var/cache/conftool/dbconfig/20240408-165931-arnaudb.json [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T1700) [17:00:04] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T1700). [17:14:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T360332)', diff saved to https://phabricator.wikimedia.org/P59885 and previous config saved to /var/cache/conftool/dbconfig/20240408-171425-arnaudb.json [17:14:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [17:14:29] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:14:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T360332)', diff saved to https://phabricator.wikimedia.org/P59886 and previous config saved to /var/cache/conftool/dbconfig/20240408-171439-arnaudb.json [17:14:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [17:14:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [17:14:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T360332)', diff saved to https://phabricator.wikimedia.org/P59887 and previous config saved to /var/cache/conftool/dbconfig/20240408-171448-arnaudb.json [17:14:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [17:15:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59888 and previous config saved to /var/cache/conftool/dbconfig/20240408-171502-arnaudb.json [17:17:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T360332)', diff saved to https://phabricator.wikimedia.org/P59889 and previous config saved to /var/cache/conftool/dbconfig/20240408-171707-arnaudb.json [17:18:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59890 and previous config saved to /var/cache/conftool/dbconfig/20240408-171819-arnaudb.json [17:29:09] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9698597 (10mforns) [17:32:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P59891 and previous config saved to /var/cache/conftool/dbconfig/20240408-173215-arnaudb.json [17:33:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P59892 and previous config saved to /var/cache/conftool/dbconfig/20240408-173327-arnaudb.json [17:34:01] (03PS1) 10Fabfur: haproxy: remove timestamp from unique-id-format [puppet] - 10https://gerrit.wikimedia.org/r/1017913 (https://phabricator.wikimedia.org/T351117) [17:37:03] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9698613 (10VRiley-WMF) Hi @Marostegui Thanks! I just completed finishing up everything with Dell. Some other troubleshooting we did was Strip the server down to bare bones (1 DIMM, 1 CPU, 1PSU) and... [17:38:53] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1017913 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [17:39:04] (03CR) 10Gmodena: haproxy: remove timestamp from unique-id-format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017913 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [17:43:20] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1017311 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [17:46:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P59893 and previous config saved to /var/cache/conftool/dbconfig/20240408-174723-arnaudb.json [17:48:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P59894 and previous config saved to /var/cache/conftool/dbconfig/20240408-174835-arnaudb.json [18:01:39] (03PS1) 10Bartosz Dziewoński: Revert "Mark all autoreviewed edits in PageSaveComplete hook" [extensions/FlaggedRevs] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1017890 (https://phabricator.wikimedia.org/T361918) [18:02:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T360332)', diff saved to https://phabricator.wikimedia.org/P59895 and previous config saved to /var/cache/conftool/dbconfig/20240408-180231-arnaudb.json [18:02:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [18:02:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [18:02:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:02:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T360332)', diff saved to https://phabricator.wikimedia.org/P59896 and previous config saved to /var/cache/conftool/dbconfig/20240408-180253-arnaudb.json [18:03:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T360332)', diff saved to https://phabricator.wikimedia.org/P59897 and previous config saved to /var/cache/conftool/dbconfig/20240408-180343-arnaudb.json [18:03:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [18:03:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [18:04:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T360332)', diff saved to https://phabricator.wikimedia.org/P59898 and previous config saved to /var/cache/conftool/dbconfig/20240408-180406-arnaudb.json [18:05:07] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission dumpsdata1001.eqiad.wmnet - https://phabricator.wikimedia.org/T362064#9698656 (10VRiley-WMF) a:03VRiley-WMF [18:07:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T360332)', diff saved to https://phabricator.wikimedia.org/P59899 and previous config saved to /var/cache/conftool/dbconfig/20240408-180724-arnaudb.json [18:13:29] (03PS1) 10Andrew Bogott: cinder backups: move real backup workloads to 200[34] [puppet] - 10https://gerrit.wikimedia.org/r/1017915 (https://phabricator.wikimedia.org/T356216) [18:13:30] (03PS1) 10Andrew Bogott: Prepare cloudbackup200[12] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1017916 (https://phabricator.wikimedia.org/T356216) [18:14:03] (03CR) 10Andrew Bogott: [C:03+2] cinder backups: move real backup workloads to 200[34] [puppet] - 10https://gerrit.wikimedia.org/r/1017915 (https://phabricator.wikimedia.org/T356216) (owner: 10Andrew Bogott) [18:14:41] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission dumpsdata1001.eqiad.wmnet - https://phabricator.wikimedia.org/T362064#9698683 (10VRiley-WMF) [18:15:08] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission dumpsdata1001.eqiad.wmnet - https://phabricator.wikimedia.org/T362064#9698695 (10VRiley-WMF) Completed decommission of this device. [18:15:21] 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission dumpsdata1001.eqiad.wmnet - 14https://phabricator.wikimedia.org/T362064#9698696 (10VRiley-WMF) 05Open→03Resolved [18:21:25] (SystemdUnitFailed) firing: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:22:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P59900 and previous config saved to /var/cache/conftool/dbconfig/20240408-182232-arnaudb.json [18:37:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P59901 and previous config saved to /var/cache/conftool/dbconfig/20240408-183739-arnaudb.json [18:38:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:41:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:52:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T360332)', diff saved to https://phabricator.wikimedia.org/P59902 and previous config saved to /var/cache/conftool/dbconfig/20240408-185247-arnaudb.json [18:52:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [18:52:51] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:53:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [18:53:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T360332)', diff saved to https://phabricator.wikimedia.org/P59903 and previous config saved to /var/cache/conftool/dbconfig/20240408-185309-arnaudb.json [18:55:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T360332)', diff saved to https://phabricator.wikimedia.org/P59904 and previous config saved to /var/cache/conftool/dbconfig/20240408-185528-arnaudb.json [19:01:18] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9698788 (10Scott_French) Thanks, @mforns! > Does this binary need to implement some of the endpoint... [19:03:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T360332)', diff saved to https://phabricator.wikimedia.org/P59905 and previous config saved to /var/cache/conftool/dbconfig/20240408-190319-arnaudb.json [19:03:26] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:08:38] (03PS16) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [19:10:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P59907 and previous config saved to /var/cache/conftool/dbconfig/20240408-191035-arnaudb.json [19:12:10] 10ops-codfw, 06SRE, 10decommission-hardware: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305#9698811 (10bking) Corrected settings by POSTing [[ https://phabricator.wikimedia.org/P59906 | this json file ]] : `curl -vH "content-type: application/json" -XPUT https://search... [19:18:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P59908 and previous config saved to /var/cache/conftool/dbconfig/20240408-191828-arnaudb.json [19:22:03] (03CR) 10Ebernhardson: Add Flink alerts for Cirrus Streaming Updater (035 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:25:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P59909 and previous config saved to /var/cache/conftool/dbconfig/20240408-192543-arnaudb.json [19:31:50] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362106 (10bking) 03NEW [19:32:11] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362106#9698856 (10bking) [19:33:05] 06SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM for new opensearch cluster - https://phabricator.wikimedia.org/T362107 (10bking) 03NEW [19:33:20] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM for new opensearch cluster - https://phabricator.wikimedia.org/T362107#9698867 (10bking) [19:33:35] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM for new opensearch cluster - https://phabricator.wikimedia.org/T362107#9698868 (10bking) [19:33:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P59910 and previous config saved to /var/cache/conftool/dbconfig/20240408-193336-arnaudb.json [19:33:45] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362106#9698870 (10bking) [19:40:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T360332)', diff saved to https://phabricator.wikimedia.org/P59911 and previous config saved to /var/cache/conftool/dbconfig/20240408-194050-arnaudb.json [19:40:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [19:40:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:41:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [19:41:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T360332)', diff saved to https://phabricator.wikimedia.org/P59912 and previous config saved to /var/cache/conftool/dbconfig/20240408-194113-arnaudb.json [19:44:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T360332)', diff saved to https://phabricator.wikimedia.org/P59913 and previous config saved to /var/cache/conftool/dbconfig/20240408-194432-arnaudb.json [19:46:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:48:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T360332)', diff saved to https://phabricator.wikimedia.org/P59914 and previous config saved to /var/cache/conftool/dbconfig/20240408-194843-arnaudb.json [19:48:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [19:48:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:48:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [19:48:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [19:49:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [19:49:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T360332)', diff saved to https://phabricator.wikimedia.org/P59915 and previous config saved to /var/cache/conftool/dbconfig/20240408-194919-arnaudb.json [19:51:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:51:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T360332)', diff saved to https://phabricator.wikimedia.org/P59916 and previous config saved to /var/cache/conftool/dbconfig/20240408-195138-arnaudb.json [19:53:41] (03PS1) 10Ebernhardson: Remove buffer on cirrussearch-request log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017929 (https://phabricator.wikimedia.org/T359580) [19:57:07] 06SRE, 06serviceops, 07Epic: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9698910 (10Scott_French) I was planning to migrate etcd to PKI as part of T350565, but can explore this earlier if needed. [19:59:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:59:34] (03PS2) 10Ebernhardson: Remove buffer on cirrussearch-request log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017929 (https://phabricator.wikimedia.org/T359580) [19:59:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P59917 and previous config saved to /var/cache/conftool/dbconfig/20240408-195940-arnaudb.json [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T2000). [20:00:04] MatmaRex and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] \o [20:00:27] hi [20:00:41] I can deploy today [20:01:19] (03CR) 10Urbanecm: [C:03+2] Revert "Mark all autoreviewed edits in PageSaveComplete hook" [extensions/FlaggedRevs] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1017890 (https://phabricator.wikimedia.org/T361918) (owner: 10Bartosz Dziewoński) [20:04:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:06:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P59918 and previous config saved to /var/cache/conftool/dbconfig/20240408-200645-arnaudb.json [20:08:43] (03Merged) 10jenkins-bot: Revert "Mark all autoreviewed edits in PageSaveComplete hook" [extensions/FlaggedRevs] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1017890 (https://phabricator.wikimedia.org/T361918) (owner: 10Bartosz Dziewoński) [20:09:24] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1017890|Revert "Mark all autoreviewed edits in PageSaveComplete hook" (T361918 T361940 T361960)]] [20:09:32] T361918: "Reverted" tag no longer applied in reviewable namespaces on wikis with FlaggedRevs since 2024-04-04 - https://phabricator.wikimedia.org/T361918 [20:09:32] T361940: Imports, moves and protections are not autoreviewed - https://phabricator.wikimedia.org/T361940 [20:09:32] T361960: The page gets unreviewed when moved to a new address - https://phabricator.wikimedia.org/T361960 [20:11:32] hmm, do we have a test wiki with flaggedrevs? [20:11:36] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:1017890|Revert "Mark all autoreviewed edits in PageSaveComplete hook" (T361918 T361940 T361960)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:43] MatmaRex: is test2.wikipedia.org good enough? [20:12:06] oh yeah, it should be [20:12:19] and yes, please test :)). [20:14:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P59919 and previous config saved to /var/cache/conftool/dbconfig/20240408-201447-arnaudb.json [20:14:55] looks good [20:15:13] !log urbanecm@deploy1002 urbanecm and matmarex: Continuing with sync [20:15:15] proceeding, ty [20:18:27] (03PS4) 10Peter Fischer: cirrus: Tune resource usage of consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858 (owner: 10Ebernhardson) [20:19:21] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM for new opensearch cluster - https://phabricator.wikimedia.org/T362107#9698960 (10bking) [20:19:41] (03CR) 10Peter Fischer: [C:03+2] cirrus: Tune resource usage of consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858 (owner: 10Ebernhardson) [20:20:40] (03Merged) 10jenkins-bot: cirrus: Tune resource usage of consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858 (owner: 10Ebernhardson) [20:20:49] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362106#9698963 (10bking) [20:21:25] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P59920 and previous config saved to /var/cache/conftool/dbconfig/20240408-202153-arnaudb.json [20:23:53] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:24:09] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:25:26] scap is still scaping [20:25:49] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1017890|Revert "Mark all autoreviewed edits in PageSaveComplete hook" (T361918 T361940 T361960)]] (duration: 16m 25s) [20:25:53] finally [20:25:54] T361918: "Reverted" tag no longer applied in reviewable namespaces on wikis with FlaggedRevs since 2024-04-04 - https://phabricator.wikimedia.org/T361918 [20:25:54] T361940: Imports, moves and protections are not autoreviewed - https://phabricator.wikimedia.org/T361940 [20:25:54] T361960: The page gets unreviewed when moved to a new address - https://phabricator.wikimedia.org/T361960 [20:25:57] ebernhardson: over to you :) [20:26:07] (03PS1) 10JHathaway: dev: update bastion cidrs [puppet] - 10https://gerrit.wikimedia.org/r/1017934 [20:27:04] urbanecm: awesome [20:29:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T360332)', diff saved to https://phabricator.wikimedia.org/P59921 and previous config saved to /var/cache/conftool/dbconfig/20240408-202955-arnaudb.json [20:29:59] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:31:07] (03CR) 10Ebernhardson: [C:03+2] "backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017929 (https://phabricator.wikimedia.org/T359580) (owner: 10Ebernhardson) [20:31:50] (03Merged) 10jenkins-bot: Remove buffer on cirrussearch-request log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017929 (https://phabricator.wikimedia.org/T359580) (owner: 10Ebernhardson) [20:32:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015393 (owner: 10Ebernhardson) [20:33:03] (03Merged) 10jenkins-bot: cirrus: Restore traffic to codfw clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015393 (owner: 10Ebernhardson) [20:35:52] (03PS2) 10JHathaway: dev: update bastion cidrs [puppet] - 10https://gerrit.wikimedia.org/r/1017934 (https://phabricator.wikimedia.org/T362109) [20:37:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T360332)', diff saved to https://phabricator.wikimedia.org/P59922 and previous config saved to /var/cache/conftool/dbconfig/20240408-203700-arnaudb.json [20:37:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [20:37:04] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:37:12] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:1015393|cirrus: Restore traffic to codfw clusters]] [20:37:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [20:37:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T360332)', diff saved to https://phabricator.wikimedia.org/P59923 and previous config saved to /var/cache/conftool/dbconfig/20240408-203723-arnaudb.json [20:39:27] !log ebernhardson@deploy1002 ebernhardson: Backport for [[gerrit:1015393|cirrus: Restore traffic to codfw clusters]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:40:18] (03CR) 10JHathaway: [C:03+2] dev: update bastion cidrs [puppet] - 10https://gerrit.wikimedia.org/r/1017934 (https://phabricator.wikimedia.org/T362109) (owner: 10JHathaway) [20:40:27] (03CR) 10Dzahn: stewards: let puppet create /srv/exports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [20:40:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T360332)', diff saved to https://phabricator.wikimedia.org/P59924 and previous config saved to /var/cache/conftool/dbconfig/20240408-204041-arnaudb.json [20:40:58] (03PS1) 10Eevans: sessionstore enable TLS verification in staging for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017935 (https://phabricator.wikimedia.org/T352647) [20:42:16] !log ebernhardson@deploy1002 ebernhardson: Continuing with sync [20:42:18] (03PS2) 10Eevans: sessionstore configure TLS verification in staging for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017935 (https://phabricator.wikimedia.org/T352647) [20:43:50] (03PS3) 10Dzahn: stewards: let puppet create /srv/exports [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) [20:52:37] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:1015393|cirrus: Restore traffic to codfw clusters]] (duration: 15m 25s) [20:53:02] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:53:10] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:53:18] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:53:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113 (10SToyofuku-WMF) 03NEW [20:53:57] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:54:23] (03CR) 10Urbanecm: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [20:55:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P59925 and previous config saved to /var/cache/conftool/dbconfig/20240408-205548-arnaudb.json [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240408T2100). [21:00:31] (03PS4) 10Dzahn: stewards: let puppet create /srv/exports [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) [21:00:31] (03PS4) 10Dzahn: stewards: puppetize steward-onboarder config file and paths [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) [21:01:22] (03CR) 10CI reject: [V:04-1] stewards: puppetize steward-onboarder config file and paths [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:01:38] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:01:51] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:01:57] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:02:13] (03CR) 10Dzahn: [C:04-1] "merged into a single change ( I67cad69044d0306de) kind of by accident during rebase" [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:02:47] (03Abandoned) 10Dzahn: stewards: let puppet create /srv/exports [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:07:33] (03PS5) 10Dzahn: stewards: add config and export dirs, steward onboarder config [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) [21:08:01] (03CR) 10CI reject: [V:04-1] stewards: add config and export dirs, steward onboarder config [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:10:15] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:10:41] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:10:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P59926 and previous config saved to /var/cache/conftool/dbconfig/20240408-211056-arnaudb.json [21:12:47] (03PS1) 10Ryan Kemper: wdqs: decom wdqs1025 [puppet] - 10https://gerrit.wikimedia.org/r/1017937 (https://phabricator.wikimedia.org/T362080) [21:16:12] (03PS6) 10Dzahn: stewards: add config and export dirs, steward onboarder config [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) [21:16:31] (03CR) 10Bking: [C:03+1] wdqs: decom wdqs1025 [puppet] - 10https://gerrit.wikimedia.org/r/1017937 (https://phabricator.wikimedia.org/T362080) (owner: 10Ryan Kemper) [21:16:53] (03CR) 10Ryan Kemper: [C:03+2] wdqs: decom wdqs1025 [puppet] - 10https://gerrit.wikimedia.org/r/1017937 (https://phabricator.wikimedia.org/T362080) (owner: 10Ryan Kemper) [21:17:58] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:18:09] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:20:27] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts wdqs1025.eqiad.wmnet [21:21:17] (03CR) 10Muehlenhoff: stewards: add config and export dirs, steward onboarder config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:23:04] (03PS7) 10Dzahn: stewards: add config and export dirs, steward onboarder config [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) [21:23:10] (03CR) 10Dzahn: stewards: add config and export dirs, steward onboarder config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:24:10] (03CR) 10Urbanecm: stewards: add config and export dirs, steward onboarder config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:24:51] (03CR) 10Dzahn: stewards: add config and export dirs, steward onboarder config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:26:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T360332)', diff saved to https://phabricator.wikimedia.org/P59927 and previous config saved to /var/cache/conftool/dbconfig/20240408-212605-arnaudb.json [21:26:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance [21:26:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [21:26:12] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [21:26:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance [21:26:24] (03CR) 10Dzahn: "do we really do "shared = true" for the repo dir but also don't let the group write to it? hmmm" [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:26:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T360332)', diff saved to https://phabricator.wikimedia.org/P59928 and previous config saved to /var/cache/conftool/dbconfig/20240408-212628-arnaudb.json [21:27:23] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:27:30] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:28:29] mutante: i can definitely write to the repo somehow :D [21:28:35] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [21:29:05] urbanecm: I think the shared = true / group = .. does that [21:29:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T360332)', diff saved to https://phabricator.wikimedia.org/P59929 and previous config saved to /var/cache/conftool/dbconfig/20240408-212946-arnaudb.json [21:30:03] mutante: it just says the umask for the `git clone` oneoff correctly :-/ . but when i run git pull from my own account, it uses the standard umask 022, so i'm creating unwriteable files :-/ [21:30:32] "fortunately" i am the only non-root user there, so it doesn't matter that much just yet. but it will at some point. [21:31:15] urbanecm: it feels like we have to decide who is pulling.. only puppet or users [21:31:52] yea. if i get things right, then puppet rn only does the clone, and then leaves things alone. [21:31:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9699166 (10NBaca-WMF) As @SToyofuku-WMF 's manager I approve this request [21:32:33] urbanecm: that's right, without "ensure latest" it will only clone it a single time on first run [21:33:01] so then we're kind of doing "only users" now it seems? [21:33:04] ensure latest from puppet is a bit controversial [21:33:32] but with that it would pull on every run [21:34:54] i don't think ensure => latest is a good idea here. once we provision actual credentials to the host, anyone with merge in gitlab would be able to convince the host to give them steward-level permissions by altering the code. [21:35:18] (03CR) 10Scott French: "Thanks for the review, Riccardo!" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017358 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [21:36:09] urbanecm: yea, that concern would be in line with https://phabricator.wikimedia.org/T218900 except that was Gerrit and not gitlab [21:36:19] exactly [21:36:31] should we be talking about that task in a public channel though? :D [21:40:25] well, the answer to the technical issue how to manage a git repo with multiple people [21:40:32] is how it was done on deploy* 8 years ago [21:40:52] and we should copy that but not use that global wikidev hack, but our local user group [21:40:58] (summary) [21:41:21] (03CR) 10Scott French: [C:03+2] Bump etcdmirror package version: 0.0.11 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017358 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [21:42:12] (03Merged) 10jenkins-bot: Bump etcdmirror package version: 0.0.11 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017358 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [21:42:52] (03CR) 10Urbanecm: "23:40 well, the answer to the technical issue how to manage a git repo with multiple people" [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:43:06] (03CR) 10Urbanecm: [C:03+1] stewards: add config and export dirs, steward onboarder config [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:44:22] (03CR) 10Scott French: [C:03+2] Release etcd-mirror 0.0.11 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017346 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [21:44:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P59930 and previous config saved to /var/cache/conftool/dbconfig/20240408-214454-arnaudb.json [21:45:14] (03CR) 10Dzahn: [C:03+2] stewards: add config and export dirs, steward onboarder config [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:46:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [21:48:13] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:48:14] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs1025.eqiad.wmnet [21:53:29] (03PS1) 10JHathaway: dev: point puppetdb to the dev puppet server [puppet] - 10https://gerrit.wikimedia.org/r/1017942 (https://phabricator.wikimedia.org/T346842) [21:56:26] (03CR) 10JHathaway: [C:03+2] dev: point puppetdb to the dev puppet server [puppet] - 10https://gerrit.wikimedia.org/r/1017942 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [22:00:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P59931 and previous config saved to /var/cache/conftool/dbconfig/20240408-220001-arnaudb.json [22:04:52] (03Merged) 10jenkins-bot: Release etcd-mirror 0.0.11 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1017346 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [22:15:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T360332)', diff saved to https://phabricator.wikimedia.org/P59933 and previous config saved to /var/cache/conftool/dbconfig/20240408-221509-arnaudb.json [22:15:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:15:13] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [22:15:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:15:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [22:15:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [22:16:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [22:16:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [22:16:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2106 (T360332)', diff saved to https://phabricator.wikimedia.org/P59934 and previous config saved to /var/cache/conftool/dbconfig/20240408-221626-arnaudb.json [22:18:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T360332)', diff saved to https://phabricator.wikimedia.org/P59935 and previous config saved to /var/cache/conftool/dbconfig/20240408-221847-arnaudb.json [22:33:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P59937 and previous config saved to /var/cache/conftool/dbconfig/20240408-223355-arnaudb.json [22:35:29] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9699261 (10Dzahn) There are a lot of alternative names on these certs that we should check individually. Then we should only create what... [22:38:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:39:33] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9699263 (10Dzahn) | kibana.certs.yaml | | name | DNS | comment |cas-logstash.wikimedia.org | | | |kibana-next.svc.codfw.wmnet | | | |kib... [22:41:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:49:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P59938 and previous config saved to /var/cache/conftool/dbconfig/20240408-224903-arnaudb.json [22:49:15] (03PS1) 10Dzahn: ssl: delete webperf.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/1017945 (https://phabricator.wikimedia.org/T360414) [22:49:27] (03CR) 10Dzahn: [C:03+2] "follow-up https://gerrit.wikimedia.org/r/1017945" [puppet] - 10https://gerrit.wikimedia.org/r/538347 (owner: 10Dzahn) [22:51:08] (03CR) 10Dzahn: "cleanup should also include private repo here:" [puppet] - 10https://gerrit.wikimedia.org/r/1017945 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [22:53:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.156s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:58:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.156s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:01:24] 10ops-codfw, 10decommission-hardware: decommission wdqs1025.eqiad.wmnet - https://phabricator.wikimedia.org/T362122 (10RKemper) 03NEW [23:01:45] 10ops-codfw, 10decommission-hardware: decommission wdqs1025.eqiad.wmnet - https://phabricator.wikimedia.org/T362122#9699294 (10RKemper) [23:04:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T360332)', diff saved to https://phabricator.wikimedia.org/P59939 and previous config saved to /var/cache/conftool/dbconfig/20240408-230410-arnaudb.json [23:04:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [23:04:19] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [23:04:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [23:04:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [23:04:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [23:05:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2119 (T360332)', diff saved to https://phabricator.wikimedia.org/P59940 and previous config saved to /var/cache/conftool/dbconfig/20240408-230502-arnaudb.json [23:07:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T360332)', diff saved to https://phabricator.wikimedia.org/P59941 and previous config saved to /var/cache/conftool/dbconfig/20240408-230723-arnaudb.json [23:09:07] (03CR) 10Tim Starling: [C:03+2] Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006181 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [23:09:50] (03Merged) 10jenkins-bot: Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006181 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [23:10:03] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9699308 (10Dzahn) [23:22:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P59942 and previous config saved to /var/cache/conftool/dbconfig/20240408-232231-arnaudb.json [23:25:33] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: stop writing to ipblocks table T355034 (duration: 12m 32s) [23:25:37] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [23:37:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P59943 and previous config saved to /var/cache/conftool/dbconfig/20240408-233739-arnaudb.json [23:37:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017459 [23:37:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017459 (owner: 10TrainBranchBot) [23:51:18] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you! 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1017945 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [23:52:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T360332)', diff saved to https://phabricator.wikimedia.org/P59944 and previous config saved to /var/cache/conftool/dbconfig/20240408-235247-arnaudb.json [23:52:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [23:52:51] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [23:53:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [23:53:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T360332)', diff saved to https://phabricator.wikimedia.org/P59945 and previous config saved to /var/cache/conftool/dbconfig/20240408-235309-arnaudb.json [23:55:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T360332)', diff saved to https://phabricator.wikimedia.org/P59946 and previous config saved to /var/cache/conftool/dbconfig/20240408-235530-arnaudb.json [23:55:31] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you! 😎" [puppet] - 10https://gerrit.wikimedia.org/r/1017806 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [23:58:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017459 (owner: 10TrainBranchBot)