[00:05:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:05:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:08:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:08:32] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:22:41] (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:41] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:25:48] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:34:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:34:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:37:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:37:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:37:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1013358 [00:37:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1013358 (owner: 10TrainBranchBot) [00:41:11] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:41:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:44:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:44:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:44:39] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:46:32] (03PS1) 10Reedy: HTMLHiddenField: Support CodexHTMLForm [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013258 (https://phabricator.wikimedia.org/T360717) [01:00:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1013358 (owner: 10TrainBranchBot) [01:04:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [01:31:38] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002 [01:31:43] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [01:36:43] (03CR) 10Pppery: MachineVision extension is being sunsetted, so stop doing dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [01:39:46] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 06Gerrit-Privilege-Requests: 14Add dani to wmf-deployment - 14https://phabricator.wikimedia.org/T360521#9652614 (10DDeSouza) 14Thanks! [01:42:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 33.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:47:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 32.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:37:17] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:58:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:58:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:02:17] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:17:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:54:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:01:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:02:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:06:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:06:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:22:41] (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:46:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:49:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:03:06] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:03:12] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:06:56] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:07:03] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:09:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:39:30] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:39:37] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240322T0600) [06:10:49] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:10:56] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:20:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:20:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:22:19] !log T358882 Updating cross-cluster seeds to bring into concordance with newly added masters: `ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst` [06:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:23] T358882: Decommission elastic2037-2054 - https://phabricator.wikimedia.org/T358882 [06:22:56] (03PS1) 10Marostegui: installserver: Do not reimage es2036 [puppet] - 10https://gerrit.wikimedia.org/r/1013438 [06:24:15] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:24:22] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:27:11] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2036 [puppet] - 10https://gerrit.wikimedia.org/r/1013438 (owner: 10Marostegui) [06:33:10] !log T358882 Also updated cross-cluster seeds for ports `9243` and `9443`. Everything should be as expected now. [06:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:34] T358882: Decommission elastic2037-2054 - https://phabricator.wikimedia.org/T358882 [06:37:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:37:32] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:45:23] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:45:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240322T0700) [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:15:39] (03PS1) 10Slyngshede: Let Bitu be a little more aggressive with loading SSH keys from LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1013442 (https://phabricator.wikimedia.org/T360634) [07:17:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:39] (03CR) 10Slyngshede: [C:03+2] Inform users that their email address needs to be unique. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede) [07:42:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:42:55] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:50:01] (03CR) 10Matthias Mullie: [C:03+1] Removing MachineVision events, extension is being sunsetted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013101 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [07:52:19] (03CR) 10Slyngshede: [C:03+2] R:idp enable new Bookworm hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [07:54:15] !log Enable Bookworm IDP/CAS/SSO servers [07:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, the additional load should be negligible." [software/bitu] - 10https://gerrit.wikimedia.org/r/1013442 (https://phabricator.wikimedia.org/T360634) (owner: 10Slyngshede) [07:55:59] (03PS1) 10Slyngshede: P:idp Fix spelling in host name [puppet] - 10https://gerrit.wikimedia.org/r/1013496 [07:57:24] (03CR) 10Slyngshede: [C:03+2] P:idp Fix spelling in host name [puppet] - 10https://gerrit.wikimedia.org/r/1013496 (owner: 10Slyngshede) [08:02:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:02:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:03:50] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [08:06:12] (03PS2) 10Slyngshede: Let Bitu be a little more aggressive with loading SSH keys from LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1013442 (https://phabricator.wikimedia.org/T360634) [08:06:25] (03CR) 10Slyngshede: Let Bitu be a little more aggressive with loading SSH keys from LDAP. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1013442 (https://phabricator.wikimedia.org/T360634) (owner: 10Slyngshede) [08:06:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:07:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:08:20] (03CR) 10Slyngshede: [C:03+2] Let Bitu be a little more aggressive with loading SSH keys from LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1013442 (https://phabricator.wikimedia.org/T360634) (owner: 10Slyngshede) [08:09:33] (03Merged) 10jenkins-bot: Let Bitu be a little more aggressive with loading SSH keys from LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1013442 (https://phabricator.wikimedia.org/T360634) (owner: 10Slyngshede) [08:10:41] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:10:48] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:10:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [08:12:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1013421 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [08:13:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [08:14:50] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [08:16:52] (03CR) 10Gmodena: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1013341 (https://phabricator.wikimedia.org/T360642) (owner: 10Fabfur) [08:17:53] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9653103 (10MoritzMuehlenhoff) [08:18:27] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [08:18:33] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [08:22:41] (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:05] (03CR) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [08:37:22] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1013339 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [08:41:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:44:21] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9653141 (10Jelto) [08:44:41] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9653145 (10Gehel) [08:46:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:46:32] 07sre-alert-triage, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496#9653157 (10Gehel) [08:46:48] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9653160 (10Gehel) [08:48:17] (03PS1) 10JMeybohm: Increase namespace quota limit for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013499 [08:49:13] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9653214 (10Gehel) [08:50:18] 07sre-alert-triage, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9653224 (10Gehel) [08:51:49] (03CR) 10Giuseppe Lavagetto: [C:03+1] Increase namespace quota limit for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013499 (owner: 10JMeybohm) [08:52:31] (03CR) 10JMeybohm: [C:03+2] Increase namespace quota limit for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013499 (owner: 10JMeybohm) [08:54:04] (03Merged) 10jenkins-bot: Increase namespace quota limit for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013499 (owner: 10JMeybohm) [08:54:05] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9653266 (10Gehel) [08:54:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:56:17] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:57:38] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:57:50] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:58:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:58:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:58:12] (03CR) 10Fabfur: "This could be abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013341" [puppet] - 10https://gerrit.wikimedia.org/r/1013275 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:58:28] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:58:57] (03PS1) 10Brouberol: Decommission aqs records [dns] - 10https://gerrit.wikimedia.org/r/1013500 (https://phabricator.wikimedia.org/T358793) [08:59:22] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9653300 (10Gehel) [08:59:44] 06SRE, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9653296 (10Gehel) [08:59:56] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#9653302 (10Gehel) [09:02:01] (03CR) 10Fabfur: [C:03+2] benthos/haproxy: delete some fields that aren't in curr webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1013341 (https://phabricator.wikimedia.org/T360642) (owner: 10Fabfur) [09:05:51] (03PS1) 10Brouberol: Decommission aqs realserver pool [puppet] - 10https://gerrit.wikimedia.org/r/1013501 (https://phabricator.wikimedia.org/T358793) [09:06:14] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [09:07:54] (03Abandoned) 10Fabfur: benthos: allow truncated http protocol version [puppet] - 10https://gerrit.wikimedia.org/r/1013275 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:12:38] (03CR) 10Jelto: [C:03+2] gitlab: temporary allow dockerfile frontend on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [09:14:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:14:39] (03CR) 10Brouberol: [C:03+2] superset-next: upgrade to 3.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013236 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol) [09:19:34] (03Abandoned) 10Fabfur: benthos: added $schema key to unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1013278 (https://phabricator.wikimedia.org/T360450) (owner: 10Fabfur) [09:19:38] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [09:20:37] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [09:23:20] (03CR) 10JMeybohm: [C:03+1] global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [09:25:39] (03PS1) 10Fabfur: benthos: add $schema key to unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1013503 (https://phabricator.wikimedia.org/T360450) [09:25:45] (03CR) 10Cparle: MachineVision extension is being sunsetted, so stop doing dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [09:25:57] (03CR) 10Muehlenhoff: "Looks good, but can only be merged when the aqs LVS config has been removed, otherwise this would page." [dns] - 10https://gerrit.wikimedia.org/r/1013500 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [09:26:29] (03CR) 10Fabfur: "Abandoned for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013503" [puppet] - 10https://gerrit.wikimedia.org/r/1013278 (https://phabricator.wikimedia.org/T360450) (owner: 10Fabfur) [09:28:55] (03PS2) 10Fabfur: benthos: add $schema key to unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1013503 (https://phabricator.wikimedia.org/T360450) [09:29:34] (03PS3) 10Majavah: ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 [09:30:31] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts apt2001.wikimedia.org [09:30:56] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1685/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [09:32:14] (03PS1) 10Jelto: Revert "gitlab: temporary allow dockerfile frontend on Trusted Runners" [puppet] - 10https://gerrit.wikimedia.org/r/1013261 (https://phabricator.wikimedia.org/T357612) [09:33:38] (03PS3) 10Fabfur: benthos: add $schema key to unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1013503 (https://phabricator.wikimedia.org/T360450) [09:34:13] (03CR) 10Majavah: [V:03+1 C:03+2] haproxy: cloud: use package{} to install haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1013308 (https://phabricator.wikimedia.org/T360630) (owner: 10Majavah) [09:34:20] (03CR) 10Majavah: [C:03+2] P:metricsinfra: haproxy: do not set httplog on backends [puppet] - 10https://gerrit.wikimedia.org/r/1013309 (owner: 10Majavah) [09:34:25] (03CR) 10Jelto: [C:03+2] Revert "gitlab: temporary allow dockerfile frontend on Trusted Runners" [puppet] - 10https://gerrit.wikimedia.org/r/1013261 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [09:35:26] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:36:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:36:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:37:13] !log jnuche@deploy1002 Installing scap version "4.73.1" for 371 hosts [09:37:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: apt2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:38:07] !log jnuche@deploy1002 Installation of scap version "4.73.1" completed for 371 hosts [09:39:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: apt2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:39:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:39:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts apt2001.wikimedia.org [09:39:50] 06SRE, 06Infrastructure-Foundations: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613#9653391 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `apt2001.wikimedia.org` - apt2001.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertm... [09:40:07] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra: haproxy: use http-request replace-path [puppet] - 10https://gerrit.wikimedia.org/r/1013310 (https://phabricator.wikimedia.org/T360630) (owner: 10Majavah) [09:42:42] (03CR) 10Brouberol: [C:03+2] admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol) [09:45:51] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:45:58] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:47:54] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:48:14] (03PS1) 10Slyngshede: Keymanagement: Bypass job queue for ssh key operations. [software/bitu] - 10https://gerrit.wikimedia.org/r/1013507 (https://phabricator.wikimedia.org/T360634) [09:49:00] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts apt1001.wikimedia.org [09:49:20] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:52:11] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:52:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:52:43] (03PS1) 10Muehlenhoff: Remove puppet references to apt1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/1013509 (https://phabricator.wikimedia.org/T331613) [09:53:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:56:51] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:56:57] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:00:50] (03PS1) 10Fabfur: benthos: drop messages containing specific BADREQ pattern [puppet] - 10https://gerrit.wikimedia.org/r/1013510 (https://phabricator.wikimedia.org/T358109) [10:01:17] (03CR) 10Fabfur: [C:03+2] benthos: add $schema key to unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1013503 (https://phabricator.wikimedia.org/T360450) (owner: 10Fabfur) [10:02:51] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: apt1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:03:21] (03PS1) 10Muehlenhoff: Remove obsolete dummy key tabs [labs/private] - 10https://gerrit.wikimedia.org/r/1013511 (https://phabricator.wikimedia.org/T331613) [10:04:51] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:04:58] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:06:15] (03CR) 10Fabfur: [C:03+2] benthos: drop messages containing specific BADREQ pattern [puppet] - 10https://gerrit.wikimedia.org/r/1013510 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:06:31] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete dummy key tabs [labs/private] - 10https://gerrit.wikimedia.org/r/1013511 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:06:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: apt1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:06:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts apt1001.wikimedia.org [10:07:01] (03CR) 10Slyngshede: Remove puppet references to apt1001/2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013509 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:07:01] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613#9653434 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `apt1001.wikimedia.org` - apt1001.wikimedia.org (**PASS**) - Downtimed... [10:09:02] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:09:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:13:13] (03CR) 10Muehlenhoff: Remove puppet references to apt1001/2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013509 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:14:20] (03CR) 10Slyngshede: Remove puppet references to apt1001/2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013509 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:14:33] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013509 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:16:21] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:16:32] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:17:36] !log uploaded jenkins 2.440.2 to apt.wikimedia.org T360759 [10:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:40] T360759: Jenkins core security advisory - 2024-03-20 - https://phabricator.wikimedia.org/T360759 [10:17:41] ^ hashar [10:18:45] (03CR) 10Muehlenhoff: [C:03+2] Remove puppet references to apt1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/1013509 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:19:11] (03CR) 10Brouberol: [C:03+2] external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:19:58] (03CR) 10Klausman: [C:03+2] ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman) [10:21:04] (03Merged) 10jenkins-bot: ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman) [10:22:56] (03CR) 10Tchanders: "Done: Ic9564486c5aee68b591caf4a4bc9d6e08826be1f, I668d9e34819af02e1c444787b97bd59c5b516316" [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [10:23:07] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [10:23:09] (03PS1) 10Brouberol: external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) [10:26:36] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9653483 (10dcaro) [10:28:09] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9653490 (10Manuel) 05Open→03Stalled [10:29:01] 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, and 2 others: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9653492 (10Clement_Goubert) That makes sense. I don't necessarily have a problem with it not using the service mesh (except for the lack of telemetry), ex... [10:29:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:29:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:30:41] (03PS1) 10Filippo Giunchedi: prometheus: scrape envoy on k8s metrics with 'usedonly' (take #2) [puppet] - 10https://gerrit.wikimedia.org/r/1013515 (https://phabricator.wikimedia.org/T359633) [10:34:42] !log btullis@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1168.eqiad.wmnet [10:34:55] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1168.eqiad.wmnet [10:35:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1013507 (https://phabricator.wikimedia.org/T360634) (owner: 10Slyngshede) [10:36:22] !log btullis@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1168.eqiad.wmnet [10:36:35] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1168.eqiad.wmnet [10:38:46] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1168.eqiad.wmnet with reason: Investigating disk errors [10:39:00] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1168.eqiad.wmnet with reason: Investigating disk errors [10:40:36] (03CR) 10Filippo Giunchedi: "I'll deploy early next week" [puppet] - 10https://gerrit.wikimedia.org/r/1013515 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [10:46:17] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9653542 (10dcaro) TSR and performance tests sent to DELL, bringing all the hosts back online. [10:47:02] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:47:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:47:32] (03CR) 10Filippo Giunchedi: [C:04-1] "See inline for the current solution comments" [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [10:51:22] (03PS2) 10Brouberol: external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) [10:54:13] (03PS3) 10Brouberol: external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) [10:54:34] (03PS1) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013518 [10:54:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:55:01] (03CR) 10Klausman: [C:03+2] ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013518 (owner: 10Klausman) [10:55:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:55:57] (03Merged) 10jenkins-bot: ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013518 (owner: 10Klausman) [10:56:32] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:56:44] (03CR) 10Clément Goubert: [C:03+1] prometheus: scrape envoy on k8s metrics with 'usedonly' (take #2) [puppet] - 10https://gerrit.wikimedia.org/r/1013515 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [10:57:20] (03PS1) 10Cparle: MachineVision extension is sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) [10:59:44] !log btullis@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1168.eqiad.wmnet [10:59:57] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1168.eqiad.wmnet [11:01:42] (03PS4) 10Brouberol: external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) [11:05:48] (03PS1) 10Muehlenhoff: cloudceph::osd: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013521 [11:05:51] (03CR) 10JMeybohm: external-services: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:06:43] (03PS5) 10Brouberol: external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) [11:06:51] (03CR) 10Brouberol: external-services: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:07:11] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#9653584 (10BTullis) I've just had a failure to update firmware for a host and a brief search led me to this issue. The error I got was from an-wo... [11:07:19] (03CR) 10Klausman: [C:03+1] profile::prometheus::k8s: move istio metrics to a separate job [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [11:07:41] (03CR) 10Klausman: [C:03+1] Add the amd-pytorch base image for ML workloads [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [11:12:28] (03PS1) 10Majavah: P:toolforge::legacy_redirector: drop configuration [puppet] - 10https://gerrit.wikimedia.org/r/1013522 (https://phabricator.wikimedia.org/T311909) [11:12:29] (03PS1) 10Majavah: P:toolforge::legacy_redirector: Use Apache on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1013523 (https://phabricator.wikimedia.org/T311909) [11:13:24] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1686/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013523 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [11:15:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:17:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:50] (03CR) 10Slyngshede: [C:03+2] Keymanagement: Bypass job queue for ssh key operations. [software/bitu] - 10https://gerrit.wikimedia.org/r/1013507 (https://phabricator.wikimedia.org/T360634) (owner: 10Slyngshede) [11:18:56] (03Merged) 10jenkins-bot: Keymanagement: Bypass job queue for ssh key operations. [software/bitu] - 10https://gerrit.wikimedia.org/r/1013507 (https://phabricator.wikimedia.org/T360634) (owner: 10Slyngshede) [11:19:51] (03CR) 10Muehlenhoff: [C:03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [11:20:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:21:54] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Move 70% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T360763 (10Clement_Goubert) 03NEW [11:22:11] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9653618 (10Clement_Goubert) [11:23:35] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Move 70% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T360763#9653616 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [11:24:09] (03PS2) 10Majavah: P:toolforge::legacy_redirector: Use Apache on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1013523 (https://phabricator.wikimedia.org/T311909) [11:25:00] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Move 70% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T360763#9653621 (10Clement_Goubert) Waiting on `codfw` repool as part of {T357547} before moving forward with this increase. [11:39:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:39:47] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:52:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:52:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:55:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:55:23] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:03:15] !log reedy@deploy1002 Synchronized php-1.42.0-wmf.23/includes/htmlform/fields/HTMLHiddenField.php: T360717 (duration: 13m 06s) [12:03:19] T360717: HTMLForm hidden fields gone -- CAPTCHA failure rate at 100% - https://phabricator.wikimedia.org/T360717 [12:17:06] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:17:13] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:35:50] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:35:58] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:44:27] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [12:51:38] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9653628 (10Milimetric) Approved! [12:51:45] (03CR) 10Reedy: [C:03+2] HTMLHiddenField: Support CodexHTMLForm [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013258 (https://phabricator.wikimedia.org/T360717) (owner: 10Reedy) [12:51:53] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9653629 (10Clement_Goubert) [12:52:05] (03PS1) 10David Caro: ceph: fix location hook path [puppet] - 10https://gerrit.wikimedia.org/r/1013524 (https://phabricator.wikimedia.org/T297083) [12:52:18] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:52:21] (03PS1) 10David Caro: ceph.eqiad: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/1013525 (https://phabricator.wikimedia.org/T297083) [12:52:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013521 (owner: 10Muehlenhoff) [12:53:09] (03PS1) 10Fabfur: benthos: enable benthos instance on upload host (cp4045) [puppet] - 10https://gerrit.wikimedia.org/r/1013526 (https://phabricator.wikimedia.org/T358109) [12:53:58] (03PS2) 10Majavah: P:toolforge::legacy_redirector: Drop configuration [puppet] - 10https://gerrit.wikimedia.org/r/1013522 (https://phabricator.wikimedia.org/T311909) [12:54:06] (03PS3) 10Majavah: P:toolforge::legacy_redirector: Use Apache on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1013523 (https://phabricator.wikimedia.org/T311909) [12:55:27] (03CR) 10David Caro: [C:03+2] ceph: fix location hook path [puppet] - 10https://gerrit.wikimedia.org/r/1013524 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:55:35] (03CR) 10David Caro: [C:03+2] ceph.eqiad: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/1013525 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:55:55] (03Merged) 10jenkins-bot: HTMLHiddenField: Support CodexHTMLForm [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013258 (https://phabricator.wikimedia.org/T360717) (owner: 10Reedy) [12:56:19] (03PS6) 10JMeybohm: external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:56:27] (03PS1) 10JMeybohm: Add external_services_definitions to fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013527 (https://phabricator.wikimedia.org/T331894) [12:56:59] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9653700 (10Clement_Goubert) [12:57:44] 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, and 2 others: 14Alter changeprop chart to use the service mesh - 14https://phabricator.wikimedia.org/T360625#9653698 (10Clement_Goubert) 05Open→03Declined 14Abandoned because the internals of changeprop make it unadvisable to add another layer. I'll... [12:57:55] (03CR) 10JMeybohm: [C:03+1] Add the amd-pytorch base image for ML workloads [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [12:58:32] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767 (10Clement_Goubert) 03NEW [12:58:44] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9653739 (10Clement_Goubert) p:05Triage→03High [12:59:12] (03CR) 10JMeybohm: [C:03+1] external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:59:20] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9653741 (10Clement_Goubert) [12:59:24] (03PS2) 10JMeybohm: Add external_services_definitions to fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013527 (https://phabricator.wikimedia.org/T331894) [12:59:28] (03PS7) 10JMeybohm: external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:59:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:00:00] (03PS1) 10Btullis: Add fabfur to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1013529 (https://phabricator.wikimedia.org/T359561) [13:00:08] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9653763 (10Clement_Goubert) [13:00:16] (03CR) 10Brouberol: [C:03+1] "Nicely done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013527 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [13:00:40] (03CR) 10JMeybohm: [C:03+2] external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:00:45] (03CR) 10JMeybohm: [C:03+2] Add external_services_definitions to fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013527 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [13:01:17] (03PS1) 10Clément Goubert: changeprop: Move staging to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013532 (https://phabricator.wikimedia.org/T360767) [13:01:20] (03PS1) 10Clément Goubert: changeprop: Move production to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013533 (https://phabricator.wikimedia.org/T360767) [13:01:28] (03CR) 10David Caro: [C:03+1] P:toolforge::checker: do not hardcode list of etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1013095 (https://phabricator.wikimedia.org/T279078) (owner: 10Majavah) [13:01:44] (03Merged) 10jenkins-bot: Add external_services_definitions to fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013527 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [13:01:53] (03Merged) 10jenkins-bot: external-services: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013512 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:03:05] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Move 70% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T360763#9653845 (10Clement_Goubert) Given we have increased `mw-web` and `mw-api-ext` by respectively 53 and 10 replicas to cope with ha... [13:03:45] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::checker: do not hardcode list of etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1013095 (https://phabricator.wikimedia.org/T279078) (owner: 10Majavah) [13:04:25] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9653725 (10BTullis) a:05Fabfur→03BTullis [13:04:53] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9653751 (10BTullis) [13:05:35] 06SRE, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9653766 (10Ladsgroup) >>! In T360029#9635703, @Ladsgroup wrote: > It might sound revolutionary but I think mediawiki should not re-implement DNS. All of the... [13:06:31] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:06:46] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:07:17] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:07:23] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:09:38] (03PS1) 10Clément Goubert: kubernetes: move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1013536 (https://phabricator.wikimedia.org/T351074) [13:11:45] (03PS1) 10Fabfur: admin: added zoe account [puppet] - 10https://gerrit.wikimedia.org/r/1013537 (https://phabricator.wikimedia.org/T360639) [13:13:45] 06SRE, 06Infrastructure-Foundations, 10netops: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772 (10cmooney) 03NEW p:05Triage→03Low [13:13:57] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [13:14:17] 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869#9653918 (10cmooney) [13:14:21] 06SRE, 06Infrastructure-Foundations, 10netops: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#9653917 (10cmooney) [13:15:29] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Infrastructure-Foundations: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802#9653901 (10BTullis) [13:16:17] (03CR) 10Alexandros Kosiaris: [C:03+1] changeprop: Move staging to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013532 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [13:16:35] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Infrastructure-Foundations: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802#9653931 (10BTullis) I'm planning to carry out {T358196} shortly, which I believe may have a beneficial impact on this ticket. I'll not merge them... [13:16:55] (03CR) 10Alexandros Kosiaris: [C:03+1] changeprop: Move production to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013533 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [13:17:33] 06SRE, 06Infrastructure-Foundations, 10netops: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#9653941 (10cmooney) [13:17:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:17:50] !log `elukey@cumin1002:~$ sudo cumin 'stat100[4,5,8,9]*' 'kill `pgrep -u kcv-wikimf`'` to unblock puppet on various stat nodes [13:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:53] (03CR) 10Clément Goubert: [C:03+1] admin: added zoe account [puppet] - 10https://gerrit.wikimedia.org/r/1013537 (https://phabricator.wikimedia.org/T360639) (owner: 10Fabfur) [13:17:55] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:18:01] (03CR) 10Alexandros Kosiaris: profile::prometheus::k8s: move istio metrics to a separate job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:18:35] (03CR) 10Fabfur: [C:03+2] admin: added zoe account [puppet] - 10https://gerrit.wikimedia.org/r/1013537 (https://phabricator.wikimedia.org/T360639) (owner: 10Fabfur) [13:19:00] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Katie Coleman - https://phabricator.wikimedia.org/T360367#9653961 (10Fabfur) Hello, thanks for this request, could your direct manager please confirm this (it's sufficient to respond to this ticket). Thanks! [13:19:07] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Katie Coleman - https://phabricator.wikimedia.org/T360367#9653962 (10Fabfur) a:03Fabfur [13:19:16] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/WMF for zoe - https://phabricator.wikimedia.org/T360639#9653963 (10Fabfur) a:03Fabfur [13:19:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:20:04] (03PS1) 10Brouberol: global_config: fix druid historical port [puppet] - 10https://gerrit.wikimedia.org/r/1013539 (https://phabricator.wikimedia.org/T331894) [13:20:20] (03PS2) 10Brouberol: global_config: fix druid historical port [puppet] - 10https://gerrit.wikimedia.org/r/1013539 (https://phabricator.wikimedia.org/T331894) [13:20:28] (03CR) 10JMeybohm: [C:03+1] global_config: fix druid historical port [puppet] - 10https://gerrit.wikimedia.org/r/1013539 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:23:34] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:23:37] (03CR) 10Brouberol: [C:03+2] global_config: fix druid historical port [puppet] - 10https://gerrit.wikimedia.org/r/1013539 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:23:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:25:03] (03CR) 10Ssingh: [C:03+1] "Looks good thanks! For our use case of the DNS hosts, manage_resolvconf set to false should cover disabling systemd-timesyncd on the DNS h" [puppet] - 10https://gerrit.wikimedia.org/r/1013382 (owner: 10Andrew Bogott) [13:28:24] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:28:42] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:29:34] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:29:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:31:45] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:31:56] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:32:51] (03PS1) 10David Caro: ceph: move the location hook to the osd top level [puppet] - 10https://gerrit.wikimedia.org/r/1013540 (https://phabricator.wikimedia.org/T297083) [13:33:08] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776 (10cmooney) 03NEW p:05Triage→03Medium [13:33:26] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9653991 (10BTullis) a:03BTullis [13:33:31] (03PS1) 10Elukey: role::docker_registry_ha::registry: increase tmpfs size in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1013541 (https://phabricator.wikimedia.org/T360637) [13:33:50] (03CR) 10Brouberol: [C:03+2] Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:34:41] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9654011 (10Papaul) @cmooney what works for you works for me as well [13:35:12] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9654012 (10Papaul) [13:35:33] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9654013 (10Papaul) [13:35:40] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013541 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [13:36:23] (03PS2) 10David Caro: ceph: move the location hook to the osd top level [puppet] - 10https://gerrit.wikimedia.org/r/1013540 (https://phabricator.wikimedia.org/T297083) [13:40:00] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1688/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013540 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [13:40:36] (03CR) 10David Caro: [V:03+1 C:03+2] ceph: move the location hook to the osd top level [puppet] - 10https://gerrit.wikimedia.org/r/1013540 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [13:41:53] (03PS1) 10Muehlenhoff: cloudceph::mon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013542 [13:42:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013542 (owner: 10Muehlenhoff) [13:47:19] (03CR) 10Btullis: [C:03+1] Decommission aqs realserver pool [puppet] - 10https://gerrit.wikimedia.org/r/1013501 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [13:47:27] (03PS1) 10Cathal Mooney: Remove config for ESI-LAG between codfw spines facing asw-b-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1013545 (https://phabricator.wikimedia.org/T360776) [13:47:44] (03PS1) 10Ssingh: P:dns::auth: remove obsolete authdns-related files [puppet] - 10https://gerrit.wikimedia.org/r/1013546 [13:48:56] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1689/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013546 (owner: 10Ssingh) [13:49:46] (03CR) 10Cathal Mooney: [C:03+2] Remove config for ESI-LAG between codfw spines facing asw-b-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1013545 (https://phabricator.wikimedia.org/T360776) (owner: 10Cathal Mooney) [13:50:25] (03Merged) 10jenkins-bot: Remove config for ESI-LAG between codfw spines facing asw-b-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1013545 (https://phabricator.wikimedia.org/T360776) (owner: 10Cathal Mooney) [13:52:10] 06SRE, 10Maps: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778 (10MoritzMuehlenhoff) 03NEW [13:52:38] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth: remove obsolete authdns-related files [puppet] - 10https://gerrit.wikimedia.org/r/1013546 (owner: 10Ssingh) [13:53:58] (03PS5) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) [13:54:29] (03PS1) 10Brouberol: Revert "superset-next: upgrade to 3.1.1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013262 [13:55:07] (03CR) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:55:38] (03PS5) 10Brouberol: Superset: migrate external services egress to Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009290 (https://phabricator.wikimedia.org/T359411) [13:56:30] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9654068 (10MoritzMuehlenhoff) [13:57:15] (03CR) 10Brouberol: [C:03+2] Revert "superset-next: upgrade to 3.1.1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013262 (owner: 10Brouberol) [14:07:15] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Katie Coleman - https://phabricator.wikimedia.org/T360367#9654082 (10KColeman-WMF) @RHo Please can you confirm that I'm permitted to access Superset and Turnilo? Thanks! [14:07:35] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on asw-b-codfw with reason: prepping to decom switch stack [14:07:49] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on asw-b-codfw with reason: prepping to decom switch stack [14:07:59] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9654083 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=79f10d11-133e-477b-be4d-b326d7e4bcf9) set by cmooney@cumin1002 for 4:00:00... [14:11:53] !log disabling LAG from asw-b-codfw to ssw-aX-codfw T360776 [14:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:58] T360776: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776 [14:16:10] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Katie Coleman - https://phabricator.wikimedia.org/T360367#9654090 (10RHo) Hi @Fabfur - Confirming @KColeman-WMF should have access to Superset and Turnilo – and Grafana too. Thank you! :) [14:16:23] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#9654089 (10brouberol) Starting today (at least for the `staging-codfw` and `dse-k8s-eqiad` clusters), apps running in Kubernetes we can use... [14:18:18] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9654094 (10cmooney) [14:19:38] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Katie Coleman - https://phabricator.wikimedia.org/T360367#9654095 (10Fabfur) Thanks, I'll notice you soon with the confirmation! [14:20:54] !log restarting Cassandra decommission of restbase1024-{b,c} — T360548 [14:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [14:23:37] (03PS1) 10Fabfur: admin: added kcoleman account [puppet] - 10https://gerrit.wikimedia.org/r/1013549 (https://phabricator.wikimedia.org/T360367) [14:25:24] (03CR) 10Ssingh: [C:03+1] admin: added kcoleman account [puppet] - 10https://gerrit.wikimedia.org/r/1013549 (https://phabricator.wikimedia.org/T360367) (owner: 10Fabfur) [14:26:54] (03PS7) 10Klausman: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) [14:27:03] (03CR) 10Fabfur: [C:03+2] admin: added kcoleman account [puppet] - 10https://gerrit.wikimedia.org/r/1013549 (https://phabricator.wikimedia.org/T360367) (owner: 10Fabfur) [14:33:00] (03PS25) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [14:35:14] !log eoghan@cumin1002 END (ERROR) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [14:35:26] (03CR) 10Arnaudb: "I've tested a bunch of cases, it should land on its feet or fail as gracefully as possible:" [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [14:35:27] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [14:35:43] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [14:36:38] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for Katie Coleman - https://phabricator.wikimedia.org/T360367#9654137 (10Fabfur) User added to the `ldap/wmf` group [14:37:11] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [14:37:18] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:25] (03PS26) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [14:37:42] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [14:37:58] jouncebot nowandnext [14:37:58] For the next 16 hour(s) and 22 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240322T0700) [14:37:58] In 16 hour(s) and 22 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240323T0700) [14:38:12] I'm rolling the train to group2 again. [14:39:17] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013555 (https://phabricator.wikimedia.org/T354441) [14:39:18] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013555 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [14:39:49] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: 14Grant Access to ldap/wmf for Katie Coleman - 14https://phabricator.wikimedia.org/T360367#9654143 (10Fabfur) 05Open→03Resolved [14:40:02] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013555 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [14:40:12] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [14:40:33] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: 14Grant Access to ldap/WMF for zoe - 14https://phabricator.wikimedia.org/T360639#9654141 (10Fabfur) 05Open→03Resolved 14User added to the `ldap/wmf` group, please let me know if you can use these services. [14:42:38] (03PS1) 10Cathal Mooney: Remove entries for asw-b-codfw switch stack [puppet] - 10https://gerrit.wikimedia.org/r/1013557 (https://phabricator.wikimedia.org/T360776) [14:43:55] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9654150 (10Jhancock.wm) @Andrew thanks for the update. Can I bug you to update the site.pp as well? Thanks! [14:45:34] (03CR) 10Brouberol: [C:03+1] Add fabfur to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1013529 (https://phabricator.wikimedia.org/T359561) (owner: 10Btullis) [14:47:18] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:16] (03PS1) 10Andrew Bogott: site.pp: add insetup entries for new cloudbackup200[34] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1013559 (https://phabricator.wikimedia.org/T356216) [14:49:30] (03CR) 10Elukey: "Left some comments, but I am wondering if we really need to set a virtual service for this use case, since we don't need much routing/prox" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [14:52:08] (03CR) 10Andrew Bogott: [C:03+2] site.pp: add insetup entries for new cloudbackup200[34] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1013559 (https://phabricator.wikimedia.org/T356216) (owner: 10Andrew Bogott) [14:52:44] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Remove asw-b-codfw from synced hiera data - cmooney@cumin1002 - T360776" [14:52:49] T360776: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776 [14:57:00] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9654167 (10MatthewVernon) One thing that was discussed at the SRE meeting in Warsaw was looking at turnilo data (which IIRC is the last 90 days' requests) to e... [15:04:23] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.23 refs T354441 [15:04:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Remove asw-b-codfw from synced hiera data - cmooney@cumin1002 - T360776" [15:04:38] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [15:04:44] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Phase out cergen for Fundraising services - https://phabricator.wikimedia.org/T360779 (10Jgreen) 03NEW [15:04:50] T360776: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776 [15:05:57] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:08:06] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9654198 (10Jgreen) [15:13:54] (03PS1) 10Elukey: role::aqs: deploy the PKI-enabled TLS bundle and use it on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) [15:15:16] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:15:22] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:15:52] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:16:24] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9654275 (10Ladsgroup) So I looked at some numbers for February: ` ladsgroup@stat1005:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --dri... [15:17:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:20:35] (03PS8) 10Klausman: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) [15:20:41] (03CR) 10Klausman: "I can split this patch to only have the ServiceEntry for testing (which I haven't done in staging, I am always a bit paranoid about live-e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:24:38] (03CR) 10Elukey: "Ok sure I didn't want to make you change another time the code review, I didn't think about this use case.. If you are not confident for t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:24:38] (03PS2) 10Jon Harald Søby: Remove Nearby extension and Minerva donate button for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013564 (https://phabricator.wikimedia.org/T360782) [15:25:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:26:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:27:49] (03PS2) 10Elukey: role::aqs: deploy the PKI-enabled TLS bundle and use it on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) [15:29:10] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:29:39] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789 (10RobH) 03NEW [15:30:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:30:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:30:11] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9654382 (10RobH) [15:34:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:34:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:35:40] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: 14Migrate apt repository to bookworm - 14https://phabricator.wikimedia.org/T331613#9654402 (10MoritzMuehlenhoff) 05Open→03Resolved 14apt.wikimedia.org is now running on two Bookworm VMs (apt1002 and apt2002), using the new/forked reprepro. The... [15:38:55] (03PS1) 10Fabfur: admin: new key for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1013569 (https://phabricator.wikimedia.org/T352098) [15:40:00] (03PS3) 10Elukey: role::aqs: deploy the PKI-enabled TLS bundle and use it on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) [15:40:00] (03PS1) 10Elukey: cassandra::instance: add the tls_use_pki_keep_old_ca parameter [puppet] - 10https://gerrit.wikimedia.org/r/1013571 (https://phabricator.wikimedia.org/T352647) [15:43:54] (03CR) 10Ssingh: [C:03+1] admin: new key for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1013569 (https://phabricator.wikimedia.org/T352098) (owner: 10Fabfur) [15:51:58] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:52:09] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:52:16] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:53:44] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098#9654462 (10XiaoXiao-WMF) new key `ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC+E7DCVBFUZ7tZrSgEhcCD/sWCI54aujtG0owaoxFEunY5HcNbl9nPjza1S... [15:55:21] (03CR) 10Fabfur: [C:03+2] admin: new key for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1013569 (https://phabricator.wikimedia.org/T352098) (owner: 10Fabfur) [15:57:04] (03PS1) 10Ebernhardson: search: Wait for young pool alert to fail for 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1013575 [15:59:58] (03CR) 10Papaul: [C:03+1] Remove entries for asw-b-codfw switch stack [puppet] - 10https://gerrit.wikimedia.org/r/1013557 (https://phabricator.wikimedia.org/T360776) (owner: 10Cathal Mooney) [16:00:27] (03CR) 10Cathal Mooney: [C:03+2] Remove entries for asw-b-codfw switch stack [puppet] - 10https://gerrit.wikimedia.org/r/1013557 (https://phabricator.wikimedia.org/T360776) (owner: 10Cathal Mooney) [16:00:35] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old asw-b-codfw entries - cmooney@cumin1002" [16:01:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old asw-b-codfw entries - cmooney@cumin1002" [16:01:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:02:12] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9654489 (10cmooney) [16:06:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - 14https://phabricator.wikimedia.org/T352098#9654491 (10Fabfur) 05Open→03Resolved [16:18:51] (03PS1) 10Milimetric: dumps.wikimedia.org/other: point clickstream link to readme [puppet] - 10https://gerrit.wikimedia.org/r/1013576 (https://phabricator.wikimedia.org/T356444) [16:21:30] (03CR) 10Btullis: [C:03+1] "Looks good. Many thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009290 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [16:22:37] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs100[6-8] - https://phabricator.wikimedia.org/T353845#9654562 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [16:25:33] (03CR) 10Btullis: [C:03+2] dumps.wikimedia.org/other: point clickstream link to readme [puppet] - 10https://gerrit.wikimedia.org/r/1013576 (https://phabricator.wikimedia.org/T356444) (owner: 10Milimetric) [16:26:50] 06SRE, 06Infrastructure-Foundations, 10netops: 14Migrate IP gateway for public1-a-codfw to spine switches - 14https://phabricator.wikimedia.org/T351532#9654574 (10cmooney) 05Open→03Resolved [16:27:28] 06SRE, 06Infrastructure-Foundations, 10netops: 14Migrate IP gateway for private1-b-codfw to spine switches - 14https://phabricator.wikimedia.org/T351534#9654580 (10cmooney) 05Open→03Resolved [16:28:35] 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938#9654586 (10cmooney) [16:28:47] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: 14Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - 14https://phabricator.wikimedia.org/T347191#9654584 (10cmooney) 05Open→03Resolved 14Closing this task, everything now completed. For future rows we can b... [16:28:56] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: 14Upgrade new codfw switches to Juniper recommended - 14https://phabricator.wikimedia.org/T341670#9654588 (10cmooney) [16:29:07] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485#9654587 (10cmooney) [16:29:32] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:32:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudbackup2003 to codfw - jhancock@cumin2002" [16:32:52] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [16:32:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudbackup2003 to codfw - jhancock@cumin2002" [16:32:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:35:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:36:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:36:14] !log cmooney@cumin1002 START - Cookbook sre.hosts.decommission for hosts sretest2003.codfw.wmnet [16:38:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudbackup2003.mgmt.codfw.wmnet with reboot policy FORCED [16:40:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 26.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:40:30] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9654614 (10cmooney) >>! In T345803#9479281, @Papaul wrote: > @cmooney can we get those 2 hosts back in decom? Thanks @papaul I'm done wit... [16:41:15] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:41:52] 06SRE, 06Infrastructure-Foundations, 10netops: 14Codfw row A/B top-of-rack switch refresh - 14https://phabricator.wikimedia.org/T327938#9654617 (10cmooney) 05Open→03Resolved a:03cmooney 14Closing this one, I've made some notes on wikitech below about how to approach these for future rows. https:/... [16:43:22] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9654622 (10cmooney) [16:45:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 26.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:51:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:52:16] (03CR) 10Dzahn: [C:03+2] doc: switch envoy ssl cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013421 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [16:52:31] (03PS2) 10Dzahn: doc: switch envoy ssl cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013421 (https://phabricator.wikimedia.org/T360413) [16:54:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:55:01] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:56:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:57:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:57:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:01:43] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Switch gitlab-replica from gitlab1004 to gitlab1003 [puppet] - 10https://gerrit.wikimedia.org/r/1013339 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [17:01:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:01:55] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:04:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:05:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:06:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:06:32] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1013421 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:07:59] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:08:06] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:08:33] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:09:26] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9654752 (10Ladsgroup) So for "miss" (=swift/thumbor hits). The top hitter gets 750 in the whole month. Quickly it settles to ~130 a month. This results to any... [17:10:28] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:10:33] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:10:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:10:53] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cmooney@cumin1002" [17:12:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cmooney@cumin1002" [17:12:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:12:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sretest2003.codfw.wmnet [17:12:28] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9654809 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1002 for hosts: `sretest2003.codfw.wmnet` - sretes... [17:12:37] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:13:32] (03PS1) 10EoghanGaffney: Revert "[gitlab] Switch gitlab-replica from gitlab1004 to gitlab1003" [puppet] - 10https://gerrit.wikimedia.org/r/1013263 [17:14:41] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update IPs for lsw irb interfaces codfw row a b private vlans - cmooney@cumin1002" [17:14:47] (03CR) 10Dzahn: [C:03+1] Revert "[gitlab] Switch gitlab-replica from gitlab1004 to gitlab1003" [puppet] - 10https://gerrit.wikimedia.org/r/1013263 (owner: 10EoghanGaffney) [17:14:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudbackup2003.mgmt.codfw.wmnet with reboot policy FORCED [17:15:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update IPs for lsw irb interfaces codfw row a b private vlans - cmooney@cumin1002" [17:15:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:35] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs100[6-8] - https://phabricator.wikimedia.org/T353845#9654829 (10VRiley-WMF) [17:15:43] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1013048 (https://phabricator.wikimedia.org/T358570) (owner: 10Brouberol) [17:17:43] !log changing IPv6 anycast GW IP on codfw row A/B switches [17:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:16] (03CR) 10EoghanGaffney: [C:03+2] Revert "[gitlab] Switch gitlab-replica from gitlab1004 to gitlab1003" [puppet] - 10https://gerrit.wikimedia.org/r/1013263 (owner: 10EoghanGaffney) [17:19:36] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs100[6-8] - https://phabricator.wikimedia.org/T353845#9654844 (10VRiley-WMF) [17:20:59] (03CR) 10Dzahn: [C:03+2] "/etc/envoy/ssl# openssl x509 -noout -ext subjectAltName -in ./discovery__doc_discovery_wmnet_server.pem" [puppet] - 10https://gerrit.wikimedia.org/r/1013421 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:20:59] (03PS1) 10C. Scott Ananian: Exclude night-mode lint from signature validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013583 (https://phabricator.wikimedia.org/T360796) [17:23:13] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs100[6-8] - https://phabricator.wikimedia.org/T353845#9654852 (10VRiley-WMF) [17:23:33] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs100[6-8] - https://phabricator.wikimedia.org/T353845#9654868 (10VRiley-WMF) This has been completed [17:23:49] 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission wdqs100[6-8] - 14https://phabricator.wikimedia.org/T353845#9654869 (10VRiley-WMF) 05Open→03Resolved [17:24:05] 10ops-eqiad, 06SRE: PowerSupplyFailure - https://phabricator.wikimedia.org/T360722#9654872 (10VRiley-WMF) a:03VRiley-WMF [17:24:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:25:39] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.gitlab.failover (exit_code=99) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [17:26:35] (03PS2) 10Dzahn: delete doc.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013419 (https://phabricator.wikimedia.org/T360413) [17:26:46] (03CR) 10Dzahn: [V:03+2 C:03+2] delete doc.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013419 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:27:18] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:29:19] (03CR) 10Dzahn: [C:03+2] ssl: delete doc.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013420 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:29:25] (03PS2) 10Dzahn: ssl: delete doc.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013420 (https://phabricator.wikimedia.org/T360413) [17:31:06] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:31:13] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:31:31] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2004'] [17:31:43] (03PS1) 10EoghanGaffney: [gitlab] Move backup script locking out of main script root [puppet] - 10https://gerrit.wikimedia.org/r/1013585 (https://phabricator.wikimedia.org/T358559) [17:32:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudbackup2004'] [17:32:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudbackup2003.mgmt.codfw.wmnet with reboot policy FORCED [17:32:23] (03CR) 10Dzahn: [V:03+2 C:03+2] ssl: delete doc.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013420 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:32:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudbackup2003.mgmt.codfw.wmnet with reboot policy FORCED [17:34:00] (03CR) 10Dzahn: [C:03+1] [gitlab] Move backup script locking out of main script root [puppet] - 10https://gerrit.wikimedia.org/r/1013585 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [17:38:23] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9654909 (10Jhancock.wm) [17:41:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:41:47] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:42:16] Hey all ideally we would have caught this yesterday but what with the train delay we now have a quite serious bug impacting editors. Is anyone able to help me backport this configuration change? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1013583 [17:42:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudbackup2003.mgmt.codfw.wmnet with reboot policy FORCED [17:42:55] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2003'] [17:43:09] (03CR) 10Jdlrobson: [C:03+1] Exclude night-mode lint from signature validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013583 (https://phabricator.wikimedia.org/T360796) (owner: 10C. Scott Ananian) [17:43:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudbackup2003'] [17:43:21] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2003'] [17:44:42] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:44:49] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:45:36] jdlrobson: I'm around [17:46:10] thanks dancy [17:46:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:46:36] !log depool ms-fe2010 [17:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013583 (https://phabricator.wikimedia.org/T360796) (owner: 10C. Scott Ananian) [17:47:45] (03Merged) 10jenkins-bot: Exclude night-mode lint from signature validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013583 (https://phabricator.wikimedia.org/T360796) (owner: 10C. Scott Ananian) [17:48:01] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1013583|Exclude night-mode lint from signature validation (T360796)]] [17:48:05] T360796: Hidden lint rules are unexpectedly applying to signatures and throwing errors - https://phabricator.wikimedia.org/T360796 [17:49:11] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9654941 (10Dzahn) [17:49:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudbackup2003'] [17:49:28] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:49:35] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:50:20] !log dancy@deploy1002 dancy and cscott: Backport for [[gerrit:1013583|Exclude night-mode lint from signature validation (T360796)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:50:33] jdlrobson: Can you test? [17:51:13] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9654946 (10BTullis) [17:51:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 37.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:51:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS bookworm [17:51:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2004.codfw.wmnet with OS bookworm [17:51:50] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9654949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2003.codfw.... [17:51:55] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9654950 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2004.codfw.... [17:53:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:53:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:56:20] (03CR) 10Dzahn: [C:03+2] releases: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:56:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:56:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:59:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:00:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:00:10] (03CR) 10Dzahn: [C:03+2] "openssl x509 -noout -ext subjectAltName -in /etc/envoy/ssl/discovery__releases_discovery_wmnet_server.pem" [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:00:33] dancy: yep on it [18:00:37] sorry got side tracked [18:00:41] thx [18:02:26] dancy: yep that works [18:02:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:02:27] please sync [18:02:33] !log dancy@deploy1002 dancy and cscott: Continuing with sync [18:02:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:03:06] Rolling out. It'll be about 15 minutes to be active everywhere. [18:03:35] (03PS2) 10Dzahn: delete releases.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013418 (https://phabricator.wikimedia.org/T360413) [18:03:59] (03CR) 10Dzahn: [V:03+2 C:03+2] delete releases.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013418 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:04:34] (03PS2) 10Dzahn: ssl: delete releases.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013414 (https://phabricator.wikimedia.org/T360413) [18:04:52] thanks dancy [18:05:07] np [18:06:03] (03CR) 10Dzahn: [C:03+2] ssl: delete releases.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013414 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:06:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:06:55] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:10:08] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9654990 (10Dzahn) [18:10:31] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9655003 (10Dzahn) [18:11:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:11:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:11:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup2004.codfw.wmnet with reason: host reimage [18:13:56] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1013583|Exclude night-mode lint from signature validation (T360796)]] (duration: 25m 55s) [18:14:01] T360796: Hidden lint rules are unexpectedly applying to signatures and throwing errors - https://phabricator.wikimedia.org/T360796 [18:14:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup2004.codfw.wmnet with reason: host reimage [18:14:41] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:14:49] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:30:59] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:35:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:35:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup2004.codfw.wmnet with OS bookworm [18:36:08] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudbackup2004.codfw.wmne... [18:37:09] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655105 (10Jhancock.wm) [19:09:47] 06SRE, 10Data Pipelines, 06Data-Engineering, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9655162 (10dr0ptp4kt) Okay, if I understand correctly, then the idea would be to... 1. Continue "allowing" tagging of wprov for non-200 HTTP responses. It'... [19:14:18] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655164 (10Jhancock.wm) cloudbackup2003 has os after a few attempts. had to delete and redo the virtual disks twice before it took. b... [19:17:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:41:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup2003.codfw.wmnet with OS bookworm [21:09:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:14:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:14:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:26:49] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:26:56] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:29:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:30:08] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:30:15] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:34:53] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:35:00] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:42:52] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:42:59] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:50:22] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9655495 (10Dzahn) [21:51:41] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9655492 (10Dzahn) 05Open→03In progress p:05Triage→03High a:03Dzahn [21:59:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:59:47] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:03:24] (03PS1) 10Urbanecm: Add CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) [22:03:25] (03PS1) 10Urbanecm: Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) [22:03:27] (03PS1) 10Urbanecm: [beta] eswiki: Enable CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013610 (https://phabricator.wikimedia.org/T357766) [22:03:29] (03PS1) 10Urbanecm: [beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) [22:08:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:08:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:13:16] (03CR) 10Dzahn: "Fair enough! Thank you, i'll do that and amend accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [22:18:44] (03PS5) 10Dzahn: prometheus/apache_exporter: drop argument parameter [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) [22:44:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:44:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:54:25] !log Phabricator - added to group WMF-NDA for private tickets: @adee_wmde, @AbbanWMDE, @Andrew-WMDE per T358578 and "NDA and MOU" spreadsheet [22:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:29] T358578: Add WMDE staff who have signed the NDA with the WMF to the WMF-NDA phabricator policy group - https://phabricator.wikimedia.org/T358578 [23:05:28] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:05:35] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:08:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:08:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:10:52] !log Phabricator - added to group WMF-NDA for private tickets: @Dima_Koushha_WMDE, @elal, @danshick_wmde, @gabriel-wmde per T358578 and "NDA and MOU" spreadsheet [23:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:57] T358578: Add WMDE staff who have signed the NDA with the WMF to the WMF-NDA phabricator policy group - https://phabricator.wikimedia.org/T358578 [23:13:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host cloudbackup2003.mgmt.codfw.wmnet with reboot policy FORCED [23:15:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS bookworm [23:15:26] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655633 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2003.codfw.... [23:17:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:20:01] !log Phabricator - added to group WMF-NDA for private tickets: @Ifrahkhanyaree_WMDE , @jon_amar-WMDE , @lilients_WMDE , @RickiJay-WMDE per T358578 and "NDA and MOU" spreadsheet [23:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:10] T358578: Add WMDE staff who have signed the NDA with the WMF to the WMF-NDA phabricator policy group - https://phabricator.wikimedia.org/T358578 [23:20:41] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:30:19] !log Phabricator - added to group WMF-NDA for private tickets: @roti_WMDE , @Siko_WMDE , @Tobi_WMDE_SW , @thiemowmde , @WMDECyn , @WMDE-Fisch per T358578 and "NDA and MOU" spreadsheet [23:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:23] T358578: Add WMDE staff who have signed the NDA with the WMF to the WMF-NDA phabricator policy group - https://phabricator.wikimedia.org/T358578 [23:48:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:54:07] (03PS1) 10Dzahn: etherpad: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013648 [23:55:02] (03PS1) 10Dzahn: stewards: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013649 [23:58:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup2003.codfw.wmnet with OS bookworm [23:59:00] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudbackup2003.codfw.wmne...