[00:27:41] (SystemdUnitFailed) firing: wmf_auto_restart_nginx.service on apt2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:17] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9648018 (10Ladsgroup) Something to consider: {T360589} [00:37:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012662 [00:37:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012662 (owner: 10TrainBranchBot) [00:44:35] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:52:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:52:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:01:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012662 (owner: 10TrainBranchBot) [01:04:35] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [01:07:58] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9648076 (10ssingh) Hi Rob: Checking if the date/time above has been confirmed by remote hands? [01:20:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:20:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:24:22] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:24:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:46:30] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:46:37] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:50:17] (03PS1) 10Pppery: Update links to point to non-wiki privacy policy and bypass redirects [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) [02:00:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:00:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:06:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:06:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:12:17] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:12:24] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:16:37] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:16:44] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:19:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:20:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:24:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:24:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:27:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:27:17] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:28:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:37:17] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:42:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:11:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:17:17] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:33:28] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:33:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:38:08] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:38:15] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:03:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 830.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:08:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 837.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:27:41] (SystemdUnitFailed) firing: wmf_auto_restart_nginx.service on apt2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:36] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:09:36] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:26:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:26:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:27:42] * kart_ will deploy cxserver.. [05:27:49] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-03-20-072017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013047 (https://phabricator.wikimedia.org/T352739) (owner: 10KartikMistry) [05:28:44] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-20-072017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013047 (https://phabricator.wikimedia.org/T352739) (owner: 10KartikMistry) [05:31:03] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:31:31] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:32:03] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:32:39] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:33:18] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:33:57] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:36:26] !log Updated cxserver to 2024-03-20-072017-production (T352739) [05:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:31] T352739: cxserver: Cannot read properties of undefined (reading 'pages') - https://phabricator.wikimedia.org/T352739 [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T0600) [06:00:04] kormat, marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T0600) [06:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:20:24] (03PS1) 10Marostegui: es2023: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1013162 (https://phabricator.wikimedia.org/T358746) [06:21:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on es[2023-2025].codfw.wmnet with reason: Migrate to 10.6 [06:21:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2023-2025].codfw.wmnet with reason: Migrate to 10.6 [06:22:01] (03CR) 10Marostegui: [C:03+2] es2023: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1013162 (https://phabricator.wikimedia.org/T358746) (owner: 10Marostegui) [06:24:03] (03PS1) 10Marostegui: installserver: Do not reimage es2035 [puppet] - 10https://gerrit.wikimedia.org/r/1013163 [06:25:12] !log dbmaint deploy schema change s2 codfw T356166 [06:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:16] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:25:33] (03PS2) 10Tim Starling: SwiftTooManyMediaUploads: reduce severity [alerts] - 10https://gerrit.wikimedia.org/r/1010347 [06:25:38] !log dbmaint deploy schema change s1 codfw T356166 [06:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on 17 hosts with reason: Schema change T356166 [06:26:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 17 hosts with reason: Schema change T356166 [06:27:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on 12 hosts with reason: Schema change T356166 [06:27:59] !log dbmaint deploy schema change s3 codfw T356166 [06:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 12 hosts with reason: Schema change T356166 [06:28:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:29:17] !log dbmaint deploy schema change s1 codfw T355609 [06:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:22] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:29:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 17 hosts with reason: Schema change T356166 [06:30:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 17 hosts with reason: Schema change T356166 [06:30:22] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:42:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:43:30] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2035 [puppet] - 10https://gerrit.wikimedia.org/r/1013163 (owner: 10Marostegui) [06:51:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [06:52:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [06:52:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:52:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:52:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T356166)', diff saved to https://phabricator.wikimedia.org/P58845 and previous config saved to /var/cache/conftool/dbconfig/20240321-065232-marostegui.json [06:52:36] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:54:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T356166)', diff saved to https://phabricator.wikimedia.org/P58846 and previous config saved to /var/cache/conftool/dbconfig/20240321-065446-marostegui.json [07:01:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:01:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P58847 and previous config saved to /var/cache/conftool/dbconfig/20240321-070954-marostegui.json [07:12:01] (03PS1) 10Giuseppe Lavagetto: wikifeeds: scale up resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013165 [07:19:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:19:37] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:19:40] (03CR) 10Giuseppe Lavagetto: [C:03+2] wikifeeds: scale up resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013165 (owner: 10Giuseppe Lavagetto) [07:20:49] (03Merged) 10jenkins-bot: wikifeeds: scale up resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013165 (owner: 10Giuseppe Lavagetto) [07:21:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:22:48] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [07:23:06] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [07:24:49] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:24:56] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:25:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P58848 and previous config saved to /var/cache/conftool/dbconfig/20240321-072501-marostegui.json [07:28:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:28:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:33:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:33:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:37:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:37:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:40:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T356166)', diff saved to https://phabricator.wikimedia.org/P58849 and previous config saved to /var/cache/conftool/dbconfig/20240321-074009-marostegui.json [07:40:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [07:40:14] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [07:40:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [07:40:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T356166)', diff saved to https://phabricator.wikimedia.org/P58850 and previous config saved to /var/cache/conftool/dbconfig/20240321-074032-marostegui.json [07:43:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:43:49] (03CR) 10Muehlenhoff: [C:03+2] sre.puppet.renew-cert: Extend help text for --installer [cookbooks] - 10https://gerrit.wikimedia.org/r/1013012 (owner: 10Muehlenhoff) [07:50:11] (03PS1) 10Slyngshede: site: Add new IDP production hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) [07:54:01] (03PS1) 10Anzx: dewiki: Enable mobile page tabs for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012115 (https://phabricator.wikimedia.org/T360246) [07:55:01] (03PS5) 10Anzx: knwikisource, knwiktionary: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010881 (https://phabricator.wikimedia.org/T360022) [08:00:04] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T0800) [08:00:05] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:07] o/ [08:04:24] (03CR) 10Muehlenhoff: site: Add new IDP production hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [08:05:51] (03PS2) 10Slyngshede: site: Add new IDP production hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) [08:05:59] (03CR) 10Slyngshede: site: Add new IDP production hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [08:27:41] (SystemdUnitFailed) firing: wmf_auto_restart_nginx.service on apt2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:37] (03CR) 10Fabfur: [C:03+2] haproxy: add parameter for optional log length [puppet] - 10https://gerrit.wikimedia.org/r/1013114 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:35:57] (03CR) 10JMeybohm: profile::prometheus::k8s: move istio metrics to a separate job (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [08:40:37] !log repooling cp4037 for about ~30m (T358109) [08:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:43] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [08:40:44] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [08:46:10] (03CR) 10Filippo Giunchedi: [C:03+1] SwiftTooManyMediaUploads: reduce severity [alerts] - 10https://gerrit.wikimedia.org/r/1010347 (owner: 10Tim Starling) [08:47:45] (03PS2) 10Jelto: gitlab: temporary allow dockerfile frontend on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612) [08:49:50] (03CR) 10Jelto: gitlab: temporary allow dockerfile frontend on Trusted Runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [08:54:36] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:55:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [08:57:09] (03CR) 10Jcrespo: [C:03+2] mediabackup::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff) [08:57:16] (03PS3) 10Jcrespo: mediabackup::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff) [08:57:31] (03CR) 10Slyngshede: [C:03+2] site: Add new IDP production hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [08:58:16] 06SRE, 10ChangeProp, 10GitLab, 10MediaWiki-File-management, 10Platform Team Initiatives (API Gateway): Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596 (10akosiaris) 03NEW [08:58:32] (03PS1) 10Fabfur: benthos: using URIPATH and URIPARAM for parsing corresponding fields [puppet] - 10https://gerrit.wikimedia.org/r/1013225 (https://phabricator.wikimedia.org/T358109) [08:59:51] (03CR) 10Jcrespo: [V:03+2 C:03+2] mediabackup::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff) [08:59:55] 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 3 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648360 (10akosiaris) [09:00:57] 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 4 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648367 (10Peachey88) [09:02:18] 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 4 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648371 (10MoritzMuehlenhoff) [09:10:08] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [09:10:39] (03CR) 10Fabfur: [C:03+2] benthos: using URIPATH and URIPARAM for parsing corresponding fields [puppet] - 10https://gerrit.wikimedia.org/r/1013225 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:12:13] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp2003.wikimedia.org [09:12:14] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [09:14:23] (03PS1) 10Muehlenhoff: Delete peopleweb dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013227 (https://phabricator.wikimedia.org/T360413) [09:14:36] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:15:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [09:16:28] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2003.wikimedia.org - slyngshede@cumin1002" [09:17:19] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2003.wikimedia.org - slyngshede@cumin1002" [09:17:19] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:17:20] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp2003.wikimedia.org on all recursors [09:17:23] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp2003.wikimedia.org on all recursors [09:17:49] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2003.wikimedia.org - slyngshede@cumin1002" [09:18:41] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2003.wikimedia.org - slyngshede@cumin1002" [09:19:13] (03PS1) 10Fabfur: benthos: uri_query should be optional [puppet] - 10https://gerrit.wikimedia.org/r/1013228 (https://phabricator.wikimedia.org/T358109) [09:22:20] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp2003.wikimedia.org with OS bookworm [09:24:51] (03CR) 10Fabfur: [C:03+2] benthos: uri_query should be optional [puppet] - 10https://gerrit.wikimedia.org/r/1013228 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:25:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T356166)', diff saved to https://phabricator.wikimedia.org/P58851 and previous config saved to /var/cache/conftool/dbconfig/20240321-092533-marostegui.json [09:25:38] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:28:12] (03CR) 10Muehlenhoff: planet: switch envoy SSL provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [09:37:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [09:38:01] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp2003.wikimedia.org with reason: host reimage [09:39:53] (03CR) 10Muehlenhoff: releases: switch SSL cert provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [09:40:29] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp2003.wikimedia.org with reason: host reimage [09:40:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P58852 and previous config saved to /var/cache/conftool/dbconfig/20240321-094041-marostegui.json [09:42:13] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1013090 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah) [09:46:22] !log akosiaris@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [09:55:18] (03CR) 10Muehlenhoff: AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [09:55:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P58853 and previous config saved to /var/cache/conftool/dbconfig/20240321-095548-marostegui.json [09:59:13] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp2003.wikimedia.org with OS bookworm [09:59:13] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp2003.wikimedia.org [09:59:48] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp1003.wikimedia.org [09:59:50] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [10:00:13] !log repooling cp4037 for about ~30m (this is last time I'll notice here, no need for this in the future) (T358109) [10:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:24] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [10:00:24] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [10:01:31] !log update ceph-reef packages to 18.2.2 on apt.wm.org [10:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:12] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1003.wikimedia.org - slyngshede@cumin1002" [10:03:40] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1003.wikimedia.org - slyngshede@cumin1002" [10:03:40] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:03:41] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp1003.wikimedia.org on all recursors [10:03:43] (03PS1) 10Fabfur: benthos: fix optional space in grok pattern (when no uri_query present) [puppet] - 10https://gerrit.wikimedia.org/r/1013232 (https://phabricator.wikimedia.org/T358109) [10:03:44] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp1003.wikimedia.org on all recursors [10:04:08] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1003.wikimedia.org - slyngshede@cumin1002" [10:04:47] (03PS5) 10Brouberol: AQS1.0: disable aqs service [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) [10:05:00] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1003.wikimedia.org - slyngshede@cumin1002" [10:05:36] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp1003.wikimedia.org with OS bookworm [10:06:14] (03CR) 10Brouberol: AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [10:09:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [10:10:14] (03PS3) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) [10:10:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T356166)', diff saved to https://phabricator.wikimedia.org/P58854 and previous config saved to /var/cache/conftool/dbconfig/20240321-101056-marostegui.json [10:10:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:11:01] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:11:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:11:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T356166)', diff saved to https://phabricator.wikimedia.org/P58855 and previous config saved to /var/cache/conftool/dbconfig/20240321-101119-marostegui.json [10:11:40] (03PS2) 10Fabfur: benthos: fix optional space in grok pattern (when no uri_query present) [puppet] - 10https://gerrit.wikimedia.org/r/1013232 (https://phabricator.wikimedia.org/T358109) [10:13:22] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [10:15:59] (03PS24) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [10:16:38] (03CR) 10Fabfur: [C:03+2] benthos: fix optional space in grok pattern (when no uri_query present) [puppet] - 10https://gerrit.wikimedia.org/r/1013232 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:17:30] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp1003.wikimedia.org with reason: host reimage [10:20:07] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp1003.wikimedia.org with reason: host reimage [10:28:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1153.eqiad.wmnet with OS bookworm [10:29:51] !log akosiaris@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [10:30:07] (03PS1) 10Phuedx: Update mediawiki.web_ui_actions stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013234 (https://phabricator.wikimedia.org/T353029) [10:31:40] (03PS2) 10Muehlenhoff: prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 [10:31:45] (03CR) 10Brouberol: [C:03+2] AQS1.0: disable aqs service [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [10:31:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff) [10:32:21] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [10:34:05] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp1003.wikimedia.org with OS bookworm [10:34:05] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp1003.wikimedia.org [10:38:40] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s: haproxy: Do not start keepalived too early [puppet] - 10https://gerrit.wikimedia.org/r/1013090 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah) [10:39:26] (03CR) 10Filippo Giunchedi: [C:03+2] installserver: update centrallog partman [puppet] - 10https://gerrit.wikimedia.org/r/1013068 (https://phabricator.wikimedia.org/T359451) (owner: 10Filippo Giunchedi) [10:41:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1153.eqiad.wmnet with reason: host reimage [10:42:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:43:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1153.eqiad.wmnet with reason: host reimage [10:50:01] (03PS1) 10Brouberol: superset-next: upgrade to 3.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013236 (https://phabricator.wikimedia.org/T358674) [10:50:44] (03PS1) 10Slyngshede: R:idp enable new Bookworm hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) [10:50:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T356166)', diff saved to https://phabricator.wikimedia.org/P58856 and previous config saved to /var/cache/conftool/dbconfig/20240321-105052-marostegui.json [10:50:57] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:51:07] (03PS1) 10Muehlenhoff: Also disable monitoring for AQS1 [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) [10:51:23] (03PS1) 10Brouberol: superset: upgrade to 3.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013239 (https://phabricator.wikimedia.org/T358674) [10:52:08] (03CR) 10Alexandros Kosiaris: [C:04-1] profile::prometheus::k8s: move istio metrics to a separate job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [10:52:15] (03CR) 10Slyngshede: "Best way to roll out this patch is probably to disable Puppet on the existing hosts and let the two new hosts come up and verify that they" [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:53:06] 06SRE, 10ChangeProp, 06Commons, 10GitLab, and 7 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648786 (10taavi) [10:53:23] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9648799 (10cmooney) >>! In T358244#9636601, @ayounsi wrote: > FYI it's alerting for one of its PSU being down, but we don't really care anymore : >> asw-a-codfw> show syste... [10:53:39] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Fix progress_bars parameter (should be print_progress_bars) [cookbooks] - 10https://gerrit.wikimedia.org/r/1010559 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [10:53:47] (03CR) 10Muehlenhoff: "There's two separate patches we need to prepare first before this can go live:" [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:55:17] (03PS1) 10Majavah: Adapt clean-stale-puppet-certs for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 [10:55:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [10:56:41] (03CR) 10CI reject: [V:04-1] Adapt clean-stale-puppet-certs for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 (owner: 10Majavah) [10:58:05] (03Merged) 10jenkins-bot: [gitlab] Fix progress_bars parameter (should be print_progress_bars) [cookbooks] - 10https://gerrit.wikimedia.org/r/1010559 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [10:58:16] (03PS1) 10Slyngshede: P:acme_chief::certificates Add new IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013241 (https://phabricator.wikimedia.org/T357748) [10:58:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1153.eqiad.wmnet with OS bookworm [10:59:03] (03PS2) 10Majavah: Adapt clean-stale-puppet-certs for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 [10:59:58] (03CR) 10Brouberol: Also disable monitoring for AQS1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff) [11:00:05] mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1100) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1100) [11:00:23] (03PS1) 10Marostegui: db1153: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1013242 (https://phabricator.wikimedia.org/T353499) [11:04:43] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9648841 (10dcaro) [11:06:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P58857 and previous config saved to /var/cache/conftool/dbconfig/20240321-110600-marostegui.json [11:06:21] (03CR) 10Brouberol: Also disable monitoring for AQS1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff) [11:06:37] (03PS1) 10Slyngshede: P:mariadb::ferm_misc Add new IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1013245 [11:07:30] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9648853 (10dcaro) New hard drives offline uncorrectable values (cloudcephosd1030) are all 0: ` root@cloudcephosd1030... [11:08:38] (03PS2) 10Slyngshede: R:idp enable new Bookworm hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) [11:11:18] (03CR) 10Marostegui: [C:03+2] db1153: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1013242 (https://phabricator.wikimedia.org/T353499) (owner: 10Marostegui) [11:21:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P58860 and previous config saved to /var/cache/conftool/dbconfig/20240321-112108-marostegui.json [11:23:24] jouncebot now [11:23:24] For the next 0 hour(s) and 36 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1100) [11:23:24] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1100) [11:25:06] mvolz: do you have anything to deploy in the next deployment window? [11:25:57] effie: no, use it if you need to [11:26:04] excellent! [11:26:09] thank you! [11:27:46] Dear Deployers, we will be switching over the deployment server, please refrain from using it until further notice [11:36:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T356166)', diff saved to https://phabricator.wikimedia.org/P58862 and previous config saved to /var/cache/conftool/dbconfig/20240321-113615-marostegui.json [11:36:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [11:36:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [11:36:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T356166)', diff saved to https://phabricator.wikimedia.org/P58863 and previous config saved to /var/cache/conftool/dbconfig/20240321-113638-marostegui.json [11:46:00] !log akosiaris@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1200) [12:00:52] !log disable puppet on deployment servers [12:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:15:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [12:16:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T356166)', diff saved to https://phabricator.wikimedia.org/P58865 and previous config saved to /var/cache/conftool/dbconfig/20240321-121628-marostegui.json [12:16:32] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:18:25] (SystemdUnitFailed) firing: imagecatalog_record.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [12:22:26] (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:31:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P58866 and previous config saved to /var/cache/conftool/dbconfig/20240321-123135-marostegui.json [12:34:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2005.codfw.wmnet with reason: host reimage [12:37:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2005.codfw.wmnet with reason: host reimage [12:39:05] !log jiji@deploy1002 Started scap: Check new deployment server (deploy1002) post switchover - March 2024 [12:46:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P58867 and previous config saved to /var/cache/conftool/dbconfig/20240321-124644-marostegui.json [12:54:39] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1300). nyaa~ [13:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] I can’t deploy, sorry [13:01:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T356166)', diff saved to https://phabricator.wikimedia.org/P58868 and previous config saved to /var/cache/conftool/dbconfig/20240321-130151-marostegui.json [13:01:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:01:55] effie, I don't think you're done with the deploy server switchover, are you? [13:02:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:02:09] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [13:02:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T356166)', diff saved to https://phabricator.wikimedia.org/P58869 and previous config saved to /var/cache/conftool/dbconfig/20240321-130213-marostegui.json [13:05:12] claime: I am still syncing worl [13:05:13] d [13:05:18] ack [13:06:14] (03PS7) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [13:06:25] (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:07:09] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9648883 (10dcaro) [13:07:33] (03CR) 10Muehlenhoff: Also disable monitoring for AQS1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff) [13:07:41] (03PS1) 10Ilias Sarantopoulos: ml-services: update articledesc and llm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013269 (https://phabricator.wikimedia.org/T360212) [13:07:49] (03PS2) 10Muehlenhoff: Also disable monitoring for AQS1 [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) [13:08:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013245 (owner: 10Slyngshede) [13:08:57] (ProbeDown) firing: (2) Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aqs:7232 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013241 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:09:26] (03PS1) 10Muehlenhoff: Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1013270 [13:09:30] moritzm: is that you? [13:09:42] (03CR) 10EoghanGaffney: gitlab: fix irc log for backup complete message (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto) [13:09:42] What s this? [13:09:58] (03CR) 10Alexandros Kosiaris: "If I understand correctly we won't be seeing metrics for upstream/downstream that don't see any rps. It's probably ok as a stopgap optimiz" [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [13:10:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:10:24] !incidents [13:10:25] 4531 (UNACKED) [2x] ProbeDown sre (ip4 aqs:7232 probes/service http_aqs_ip4) [13:10:25] 4530 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [13:10:25] 4529 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [13:10:25] 4528 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [13:10:30] (03CR) 10EoghanGaffney: [C:03+1] apt-staging: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1012346 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [13:10:38] !ack 4531 [13:10:38] 4531 (ACKED) [2x] ProbeDown sre (ip4 aqs:7232 probes/service http_aqs_ip4) [13:10:46] (03CR) 10Alexandros Kosiaris: "I forgot to ask, is this a stopgap? Do we intend to revert it once we 've sorted out other prometheus infrastructure related issues?" [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [13:11:13] sukhe: heyas i was out of it yesterday booster knocked me on my ass so I didn't write up the detailed directions for remote hands esams [13:11:14] akosiaris: I think its from https://phabricator.wikimedia.org/T360522 [13:11:30] bah wrong channel meant to state in dc ops lol [13:11:35] (03CR) 10Klausman: [C:03+1] ml-services: update articledesc and llm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013269 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [13:11:49] and the monitoring disable patch is just a bit late [13:12:26] cc brouberol / moritzm [13:12:47] (03PS8) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [13:13:09] wow, that Gerrit/wikibugs message sure had a long delay. I +1'd at 13:01 according to the webui, and the bot only mentioned it at 14:11? [13:13:17] Ok, so it should clear on its own? [13:13:19] (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:13:42] akosiaris: just a wild guess because of the coincidence [13:13:44] sorry about the false-alarm. We had a monitor CR merged to disable the alarm altogether :/ [13:14:20] we disabled AQS probe monitoring and then disabled the AQS service, so that's no coincidence. However, we didn't anticipate the alert still firing [13:14:26] !log jiji@deploy1002 Finished scap: Check new deployment server (deploy1002) post switchover - March 2024 (duration: 35m 20s) [13:14:39] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:14:40] (03PS5) 10MVernon: Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) [13:14:48] (03CR) 10MVernon: "Thanks for those two spots; I've corrected both (and updated version to match the newer upstream packages I've pulled to our apt repo), an" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:15:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.89% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:15:44] (03PS1) 10Effie Mouzeli: deployment: update deployment DNS record to deploy1002 (switchover #6) [dns] - 10https://gerrit.wikimedia.org/r/1013272 (https://phabricator.wikimedia.org/T357547) [13:15:52] (03PS1) 10KartikMistry: Update cxserver to 2024-03-21-114859-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013273 (https://phabricator.wikimedia.org/T353510) [13:16:13] brouberol: ack - it probably just missing a puppet run then to take effect [13:16:22] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:16:32] 06SRE, 10ChangeProp, 06Commons, 10GitLab, and 7 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648979 (10Reedy) https://github.com/Snapchat/KeyDB already existed as a fork. https://github.com/Snapchat/KeyDB/issues/798 was filed ex... [13:17:00] (03CR) 10Slyngshede: [C:03+2] P:mariadb::ferm_misc Add new IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1013245 (owner: 10Slyngshede) [13:17:06] Dear Deployers, deployment server is switched to deploy1002, you can proceed [13:17:08] (03CR) 10Brouberol: [C:03+1] Also disable monitoring for AQS1 [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff) [13:17:16] (03CR) 10Slyngshede: [C:03+2] P:acme_chief::certificates Add new IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013241 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:17:48] (03PS1) 10Effie Mouzeli: hieradata: update deployment_server to deploy1002 (switchover #7) [puppet] - 10https://gerrit.wikimedia.org/r/1013274 (https://phabricator.wikimedia.org/T357547) [13:17:56] (03Abandoned) 10Clément Goubert: Revert "Add File:Claus_-_Conkle to blacklist" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012771 (owner: 10Clément Goubert) [13:18:20] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update articledesc and llm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013269 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [13:18:36] (03Merged) 10jenkins-bot: ml-services: update articledesc and llm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013269 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [13:19:09] (03PS1) 10Fabfur: benthos: allow truncated http protocol version [puppet] - 10https://gerrit.wikimedia.org/r/1013275 (https://phabricator.wikimedia.org/T358109) [13:19:25] (03CR) 10Alexandros Kosiaris: [C:03+1] hieradata: update deployment_server to deploy1002 (switchover #7) [puppet] - 10https://gerrit.wikimedia.org/r/1013274 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:19:41] (03CR) 10Alexandros Kosiaris: [C:03+1] deployment: update deployment DNS record to deploy1002 (switchover #6) [dns] - 10https://gerrit.wikimedia.org/r/1013272 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:19:53] (03CR) 10Effie Mouzeli: [C:03+2] deployment: update deployment DNS record to deploy1002 (switchover #6) [dns] - 10https://gerrit.wikimedia.org/r/1013272 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:20:00] wikibugs is prolly doing something it shouldnt [13:20:09] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: update deployment_server to deploy1002 (switchover #7) [puppet] - 10https://gerrit.wikimedia.org/r/1013274 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:20:41] (03PS1) 10Fabfur: benthos: added $schema key to unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1013278 (https://phabricator.wikimedia.org/T360450) [13:22:06] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9649036 (10Clement_Goubert) [13:23:26] (03PS1) 10Cparle: Sunsetting MachineVision extension, so remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013284 (https://phabricator.wikimedia.org/T352884) [13:23:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9649062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [13:23:46] (03CR) 10Stevemunene: [C:03+1] superset-next: upgrade to 3.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013236 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol) [13:25:59] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012664 [13:27:03] (03PS1) 10Ammarpad: Set wgUploadNavigationUrl for is.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013295 (https://phabricator.wikimedia.org/T360431) [13:27:39] 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625 (10Clement_Goubert) 03NEW [13:27:51] 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649117 (10Clement_Goubert) p:05Triage→03High [13:30:40] (03CR) 10Muehlenhoff: [C:03+2] Also disable monitoring for AQS1 [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff) [13:31:29] (03CR) 10Muehlenhoff: Inform users that their email address needs to be unique. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede) [13:31:45] 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649168 (10Clement_Goubert) [13:33:57] (03PS2) 10Slyngshede: Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 [13:34:05] (03CR) 10Slyngshede: Inform users that their email address needs to be unique. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede) [13:34:21] (03CR) 10Muehlenhoff: [C:03+1] Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede) [13:34:37] (03CR) 10Slyngshede: [C:03+2] Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede) [13:34:53] (03Merged) 10jenkins-bot: Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede) [13:36:38] (03PS1) 10Clément Goubert: envoy: Add mw-jobrunner and videoscaler listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625) [13:38:34] (03PS1) 10Slyngshede: Bitu version 0.0.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1013304 [13:38:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1013304 (owner: 10Slyngshede) [13:39:07] (03CR) 10Slyngshede: [C:03+2] Bitu version 0.0.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1013304 (owner: 10Slyngshede) [13:39:39] (03Merged) 10jenkins-bot: Bitu version 0.0.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1013304 (owner: 10Slyngshede) [13:41:43] (03PS9) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [13:42:23] (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:42:49] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9649261 (10RobH) > We would like remote hands to fetch shipmnet DEL0158639 which contains (8) 6.5TB NVMe PCIe SSDs from Dell NL to Wikimedia. > > Proposted Work Window: 2023-03-27 @ 1100 CET >... [13:45:48] (03PS2) 10Jelto: gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 [13:48:17] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9649336 (10ssingh) @RobH: Verified the hosts, serial numbers, racking and the cadence. Looks good! [13:48:29] (03PS1) 10Majavah: haproxy: cloud: use package{} to install haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1013308 (https://phabricator.wikimedia.org/T360630) [13:48:37] (03PS1) 10Majavah: P:metricsinfra: haproxy: do not set httplog on backends [puppet] - 10https://gerrit.wikimedia.org/r/1013309 [13:48:45] (03PS1) 10Majavah: P:wmcs::metricsinfra: haproxy: use http-request replace-path [puppet] - 10https://gerrit.wikimedia.org/r/1013310 (https://phabricator.wikimedia.org/T360630) [13:49:26] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1676/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013308 (https://phabricator.wikimedia.org/T360630) (owner: 10Majavah) [13:49:34] (03CR) 10Jelto: gitlab: fix irc log for backup complete message (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto) [13:50:03] !log upgrading pdns-rec to 4.8.7-1 on dns* and doh* hosts [13:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:00] (03PS1) 10EoghanGaffney: [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 [13:52:35] Are these notifications delayed? I put up PS1 for this about 20 minutes ago [13:52:39] (03CR) 10EoghanGaffney: [C:03+1] gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto) [13:52:55] (03CR) 10Majavah: Inform users that their email address needs to be unique. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede) [13:53:35] (03CR) 10JMeybohm: external-services: define a chart referencing external services clusters (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:53:43] (03PS2) 10Tchanders: Schedule weekly purge of global_block_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) [13:53:59] 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, and 2 others: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649410 (10Joe) There is a few reasons why we didn't migrate changeprop to use the service mesh, first of all the fact we don't want to define timeouts ou... [13:54:11] (03CR) 10JMeybohm: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol) [13:54:21] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9649412 (10RobH) CS1553796 created. Will update one they confirm the window. [13:54:41] (03CR) 10CI reject: [V:04-1] [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 (owner: 10EoghanGaffney) [13:54:49] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9649413 (10RobH) [13:55:09] (03PS1) 10David Caro: puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312 [13:55:33] (03PS2) 10David Caro: puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312 [13:55:42] (03CR) 10David Caro: puppetserver.cloud_vps: add role without stale certs check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013312 (owner: 10David Caro) [13:56:10] (03PS3) 10David Caro: puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312 [13:56:50] (03CR) 10Majavah: [C:03+1] puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312 (owner: 10David Caro) [13:57:18] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9649435 (10Jhancock.wm) [13:57:44] (03CR) 10JMeybohm: external-services: define a chart referencing external services clusters (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:57:52] (03CR) 10David Caro: [C:03+2] puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312 (owner: 10David Caro) [13:58:24] (03PS2) 10EoghanGaffney: [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 [14:00:29] (03CR) 10JMeybohm: "Not a beauty but practical 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [14:01:11] (03PS4) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) [14:01:19] (03CR) 10JMeybohm: "*remove the need for the variable assignments" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [14:01:27] (03CR) 10Elukey: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [14:02:02] !log installing squid security updates [14:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:42] (03CR) 10JMeybohm: Add template rendering external services egress NetworkPolicy resources (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:04:30] 06SRE: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636 (10MoritzMuehlenhoff) 03NEW [14:04:38] (03PS1) 10Klausman: ml-services: Free up unused nllb200 pods in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013317 [14:04:46] (03CR) 10Brouberol: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol) [14:05:22] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [14:05:29] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Free up unused nllb200 pods in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013317 (owner: 10Klausman) [14:05:53] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9649524 (10MoritzMuehlenhoff) [14:06:01] (03CR) 10Hashar: [C:03+1] "That would do it, at least using the example given on T358940. `1006969` is linked while `#1006969` is not. That is a nice hack. I think G" [puppet] - 10https://gerrit.wikimedia.org/r/1013097 (https://phabricator.wikimedia.org/T358940) (owner: 10Aklapper) [14:06:33] (03CR) 10Jelto: [C:03+2] gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto) [14:07:07] (03PS1) 10Ammarpad: throttle: Add throttle rule for editathon at Illinois Tech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) [14:07:15] (03CR) 10CI reject: [V:04-1] throttle: Add throttle rule for editathon at Illinois Tech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) (owner: 10Ammarpad) [14:07:40] (03CR) 10Klausman: [C:03+2] ml-services: Free up unused nllb200 pods in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013317 (owner: 10Klausman) [14:08:20] (03CR) 10Filippo Giunchedi: [C:03+1] Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff) [14:09:12] (03Merged) 10jenkins-bot: ml-services: Free up unused nllb200 pods in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013317 (owner: 10Klausman) [14:09:20] (03PS2) 10Ammarpad: throttle: Add throttle rule for editathon at Illinois Tech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) [14:09:36] (03Merged) 10jenkins-bot: gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto) [14:10:40] (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1013097 (https://phabricator.wikimedia.org/T358940) (owner: 10Aklapper) [14:12:23] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/WMF for zoe - https://phabricator.wikimedia.org/T360639 (10zoe) 03NEW [14:13:03] (03PS7) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [14:13:52] (03CR) 10CI reject: [V:04-1] global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [14:14:32] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/WMF for zoe - https://phabricator.wikimedia.org/T360639#9649648 (10zoe) [14:16:02] (03PS1) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 [14:16:10] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:16:34] (03PS8) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [14:17:15] (03CR) 10Ilias Sarantopoulos: ml-services: fix discrepancies caused by shoddy c&p in 1013317 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman) [14:18:09] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1678/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [14:18:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudbackup2004 to codfw - jhancock@cumin2002" [14:19:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudbackup2004 to codfw - jhancock@cumin2002" [14:19:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:20:07] (03CR) 10JMeybohm: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol) [14:20:15] (03PS25) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [14:20:23] (03CR) 10Brouberol: external-services: define a chart referencing external services clusters (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:21:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudbackup2004.mgmt.codfw.wmnet with reboot policy FORCED [14:21:12] (03PS4) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) [14:21:20] (03CR) 10Brouberol: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol) [14:21:36] (03CR) 10CI reject: [V:04-1] external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:24:32] (03PS1) 10Muehlenhoff: Point urldownloader in eqiad to 1004 [dns] - 10https://gerrit.wikimedia.org/r/1013322 [14:26:45] (03CR) 10Muehlenhoff: [C:03+2] Point urldownloader in eqiad to 1004 [dns] - 10https://gerrit.wikimedia.org/r/1013322 (owner: 10Muehlenhoff) [14:27:32] (03CR) 10Muehlenhoff: [C:03+2] Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff) [14:28:02] (03Abandoned) 10Muehlenhoff: prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff) [14:28:56] (03CR) 10Filippo Giunchedi: [C:03+2] "Good questions! If the impact is significant and the trade-offs acceptable (e.g. the dashboards like you mentioned) then ideally I'd like " [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [14:31:22] (03CR) 10Jelto: [C:03+1] "lgtm, according to debmonitor there is no host on bookworm using the exporter currently: https://debmonitor.wikimedia.org/packages/prometh" [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [14:32:12] (03CR) 10Jelto: [V:03+1 C:03+2] etherpad: install mariadb server in wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003769 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:33:21] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 (owner: 10EoghanGaffney) [14:33:35] (03PS2) 10Clément Goubert: envoy: Add missing service mesh listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625) [14:33:38] (03PS1) 10Muehlenhoff: aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522) [14:34:23] (03PS4) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) [14:35:20] (03CR) 10Brouberol: Add template rendering external services egress NetworkPolicy resources (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:35:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T356166)', diff saved to https://phabricator.wikimedia.org/P58871 and previous config saved to /var/cache/conftool/dbconfig/20240321-143528-marostegui.json [14:35:33] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [14:35:41] (03PS5) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) [14:35:48] (03CR) 10CI reject: [V:04-1] Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:36:29] !log installing glibc security updates on bullseye [14:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:55] (03PS6) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) [14:37:17] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:22] (03CR) 10CI reject: [V:04-1] aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff) [14:39:10] (03PS7) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) [14:39:39] (03PS26) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [14:40:00] (03PS5) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) [14:41:20] brouberol: re: aqs probes you'll have to change the service status in hieradata/common/service.yaml or set page: false in the service stanza btw [14:41:53] moritzm: could you have a look please? I'm about to go get my kid from daycare. Thank you! [14:41:59] assuming that's the intended idea, i.e. aqs.discovery.wmnet backends no longer being a thing [14:42:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:23] (03CR) 10Muehlenhoff: [C:03+1] "The idp_nodes value is eventually passed down to the memcached config, so that all IDPs update the same memcached backends. Given they are" [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [14:42:32] yes, we were asked to decom AQS, so this is very much no longer a thing [14:42:54] (03PS1) 10Majavah: ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 [14:43:04] brouberol: ack, I'll do the page: false setting thing [14:43:08] (03PS3) 10Clément Goubert: envoy: Add missing service mesh listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625) [14:43:46] I can take care of the CR, but I need to get going right after, meaning I won't be able to deploy it for a bit [14:43:56] I didn't mean to impose :/ [14:43:58] 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, and 2 others: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649779 (10Clement_Goubert) [14:44:40] brouberol: no worries at all! easy enough and I'm in the middle of deploying another prometheus change which means puppet is stopped anyways [14:44:52] appreciated, thank you! [14:45:04] godog, brouberol: let's set page=false as an interim and then we can yank the entire service definition as a followup [14:45:15] brouberol: sure np [14:45:18] (03PS2) 10Majavah: ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 [14:45:18] moritzm: ack, will do [14:46:05] (03PS1) 10Filippo Giunchedi: hieradata: set aqs to non-paging [puppet] - 10https://gerrit.wikimedia.org/r/1013325 (https://phabricator.wikimedia.org/T360522) [14:46:08] ^ [14:46:29] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1681/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [14:47:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013325 (https://phabricator.wikimedia.org/T360522) (owner: 10Filippo Giunchedi) [14:47:25] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: set aqs to non-paging [puppet] - 10https://gerrit.wikimedia.org/r/1013325 (https://phabricator.wikimedia.org/T360522) (owner: 10Filippo Giunchedi) [14:47:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 17 hosts with reason: Schema change T356166 [14:48:03] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [14:48:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 17 hosts with reason: Schema change T356166 [14:48:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641 (10MPhamWMF) 03NEW [14:48:40] (03CR) 10CI reject: [V:04-1] ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [14:50:01] (03CR) 10Majavah: "This was still in use in profile::toolforge::prometheus?" [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff) [14:50:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P58872 and previous config saved to /var/cache/conftool/dbconfig/20240321-145036-marostegui.json [14:51:55] (03PS27) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [14:51:57] (03CR) 10Muehlenhoff: [C:03+2] "How so? Per PCC it's unused, see the earlier PCC output." [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff) [14:52:13] (03CR) 10Brouberol: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:52:58] (03PS9) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [14:53:32] (03PS1) 10Muehlenhoff: Revert "Remove unused profile" [puppet] - 10https://gerrit.wikimedia.org/r/1013326 [14:54:33] (03CR) 10Majavah: "I think the PCC sync might be broken due to our recent Puppet 7 migration (and Andrew is working on fixing it), but the profile is very mu" [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff) [14:55:53] (03PS1) 10Majavah: P:toolforge::prometheus: Fix blackbox exporter installation [puppet] - 10https://gerrit.wikimedia.org/r/1013327 [14:56:11] (03CR) 10Majavah: "Or let's do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013327 instead?" [puppet] - 10https://gerrit.wikimedia.org/r/1013326 (owner: 10Muehlenhoff) [14:56:57] (03CR) 10CI reject: [V:04-1] global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [14:57:17] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013327 (owner: 10Majavah) [14:57:37] (03CR) 10Muehlenhoff: "Sure, that works. +1d" [puppet] - 10https://gerrit.wikimedia.org/r/1013326 (owner: 10Muehlenhoff) [14:57:43] (03CR) 10Dzahn: "prometheus::blackbox::check::http which is used all over the place says:" [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff) [14:58:38] (03CR) 10Ssingh: [C:03+1] "Key verified out of band." [puppet] - 10https://gerrit.wikimedia.org/r/1013139 (owner: 10CDobbins) [14:58:40] (03CR) 10Ssingh: [C:03+2] admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1013139 (owner: 10CDobbins) [14:59:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:00:04] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Fix blackbox exporter installation [puppet] - 10https://gerrit.wikimedia.org/r/1013327 (owner: 10Majavah) [15:02:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9649869 (10MMiller_WMF) I am Mike's manager and I approve this request! [15:03:48] (03CR) 10Muehlenhoff: "I'm still reverting the change, though since apparently this is also used by the Ci..." [puppet] - 10https://gerrit.wikimedia.org/r/1013326 (owner: 10Muehlenhoff) [15:03:56] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove unused profile" [puppet] - 10https://gerrit.wikimedia.org/r/1013326 (owner: 10Muehlenhoff) [15:04:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudbackup2004.mgmt.codfw.wmnet with reboot policy FORCED [15:05:42] (03PS2) 10Muehlenhoff: aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522) [15:05:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P58873 and previous config saved to /var/cache/conftool/dbconfig/20240321-150544-marostegui.json [15:06:08] (03PS2) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 [15:06:13] (03PS10) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [15:06:45] (03CR) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman) [15:06:59] (03CR) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman) [15:09:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 36.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:11:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:12:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudbackup2004.mgmt.codfw.wmnet with reboot policy FORCED [15:14:18] (03CR) 10BryanDavis: [C:04-2] "Let's sit on this idea for a bit while we wait to see if a strong hard fork of Redis shows up following https://redis.com/blog/redis-adopt" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) (owner: 10BryanDavis) [15:14:42] (03PS1) 10Cparle: MachineVision is being sunsetted, so remove job [puppet] - 10https://gerrit.wikimedia.org/r/1013329 (https://phabricator.wikimedia.org/T352884) [15:16:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:16:42] (03CR) 10JMeybohm: external-services: define a chart referencing external services clusters (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [15:16:55] (03Restored) 10Muehlenhoff: prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff) [15:18:58] (ProbeDown) resolved: (2) Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aqs:7232 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9649919 (10Papaul) dbprov2005 re-image is stocked at puppet run. When i login to the server and try to manually run puppet i get the error below. ` Error: The CRL issued by 'CN=... [15:20:35] (03CR) 10JMeybohm: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol) [15:20:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T356166)', diff saved to https://phabricator.wikimedia.org/P58874 and previous config saved to /var/cache/conftool/dbconfig/20240321-152051-marostegui.json [15:20:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:20:57] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [15:20:59] 06SRE, 10ChangeProp, 06Commons, 10GitLab, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9649904 (10brennen) For GitLab: I //think// we currently run the bundled Redis in their Omnibus package. In that case, the easiest thing... [15:21:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:21:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [15:21:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [15:21:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T356166)', diff saved to https://phabricator.wikimedia.org/P58875 and previous config saved to /var/cache/conftool/dbconfig/20240321-152134-marostegui.json [15:21:41] (03CR) 10Jcrespo: "Thank you, you now understood what I meant. Looking good, no blockers on my side to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [15:22:22] (03PS28) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [15:22:31] (03CR) 10Brouberol: external-services: define a chart referencing external services clusters (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [15:22:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudbackup2004.mgmt.codfw.wmnet with reboot policy FORCED [15:23:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:24:44] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9649944 (10thcipriani) [15:25:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9649946 (10Jhancock.wm) Found the drive as absent in iDRAC. Physically, the drive is there but is not blinking like the other drives.... [15:25:23] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [15:26:09] (03CR) 10Clément Goubert: [C:03+1] Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:26:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2004'] [15:27:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudbackup2004'] [15:27:41] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2004'] [15:28:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:30:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:31:08] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 (owner: 10EoghanGaffney) [15:32:30] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbprov2005.codfw.wmnet with OS bullseye [15:32:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye executed with errors: - dbprov20... [15:33:21] (03PS8) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) [15:33:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [15:33:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [15:33:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudbackup2004'] [15:34:01] (03CR) 10Brouberol: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol) [15:34:58] (03Merged) 10jenkins-bot: [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 (owner: 10EoghanGaffney) [15:35:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:37:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650056 (10jcrespo) That is new and doesn't happen on the old hosts, but not a big blocker. However, the OS was installed on the SSDs, not on the HDs- that is much more unfixabl... [15:38:33] (03CR) 10Brouberol: [C:03+1] aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff) [15:41:29] (03PS1) 10Filippo Giunchedi: Revert "prometheus: scrape envoy on k8s metrics with 'usedonly'" [puppet] - 10https://gerrit.wikimedia.org/r/1013254 [15:41:31] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9650076 (10Jhancock.wm) [15:41:45] (03CR) 10Clément Goubert: [C:03+1] Revert "prometheus: scrape envoy on k8s metrics with 'usedonly'" [puppet] - 10https://gerrit.wikimedia.org/r/1013254 (owner: 10Filippo Giunchedi) [15:41:54] (03CR) 10Aklapper: [C:03+2] "+2 self-approving" [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/984213 (https://phabricator.wikimedia.org/T338611) (owner: 10Aklapper) [15:42:21] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] Revert "prometheus: scrape envoy on k8s metrics with 'usedonly'" [puppet] - 10https://gerrit.wikimedia.org/r/1013254 (owner: 10Filippo Giunchedi) [15:42:36] (03CR) 10Aklapper: [V:03+2 C:03+2] AVA: Remove unused variable; take age into account [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/984213 (https://phabricator.wikimedia.org/T338611) (owner: 10Aklapper) [15:50:41] !log cgoubert@deploy1002:~$ sudo chown imagecatalog:imagecatalog /srv/deployment/imagecatalog/catalog.sqlite [15:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:19] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:53:20] (03PS1) 10Muehlenhoff: Configure dbprov2005/2006 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013332 (https://phabricator.wikimedia.org/T355355) [15:53:25] (SystemdUnitFailed) resolved: imagecatalog_record.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:48] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9650092 (10ovasileva) >>! In T355914#9578211, @Jdlrobson wrote: > Providing engineering perspective on behalf of the WMF web team, I agree that if we wan... [15:54:49] (03CR) 10Ssingh: [C:03+2] cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:54:59] (03CR) 10Ssingh: [C:03+2] "Thanks for the review volans!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:55:01] (03CR) 10Ssingh: [V:03+2 C:03+2] cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:56:01] (03CR) 10Muehlenhoff: [C:03+2] Configure dbprov2005/2006 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013332 (https://phabricator.wikimedia.org/T355355) (owner: 10Muehlenhoff) [16:00:04] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:15] (03CR) 10Dzahn: [C:03+2] requesttracker: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [16:01:28] (03PS2) 10Dzahn: requesttracker: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413) [16:04:15] (03CR) 10Dzahn: [C:03+2] Delete peopleweb dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013227 (https://phabricator.wikimedia.org/T360413) (owner: 10Muehlenhoff) [16:04:17] (03CR) 10Dzahn: [V:03+2 C:03+2] Delete peopleweb dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013227 (https://phabricator.wikimedia.org/T360413) (owner: 10Muehlenhoff) [16:04:59] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox [16:05:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:06:07] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9650260 (10cmooney) >>! In T326322#9130092, @ayounsi wrote: > @cmooney I came across https://www.juniper.net/documentation/us/en/softwar... [16:06:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T356166)', diff saved to https://phabricator.wikimedia.org/P58878 and previous config saved to /var/cache/conftool/dbconfig/20240321-160653-marostegui.json [16:06:57] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [16:07:03] !log disabling read-repair (Cassandra) for restbase tables — T360548 [16:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:07] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [16:08:38] (03PS1) 10Elukey: Add the amd-pytorch base image for ML workloads [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) [16:10:15] (03CR) 10Dzahn: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [16:12:55] (03PS56) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [16:13:05] (03CR) 10Elukey: "Hi folks! This is the first version of the Pytorch's base image. The total size is 12.4GB (!! sigh), but I am able to run a python3 interp" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [16:13:40] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: 14decommission db2096 - 14https://phabricator.wikimedia.org/T360554#9650362 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:14:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:14:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:14:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye [16:15:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:17:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [16:17:39] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.dns.roll-reboot (exit_code=97) rolling reboot on A:dnsbox [16:17:50] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [16:18:30] (ProbeDown) firing: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:25] (03PS1) 10Ssingh: sre.dns.roll-reboot: fix typo in depool_sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/1013336 [16:20:50] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/WMF for zoe - https://phabricator.wikimedia.org/T360639#9650405 (10VPuffetMichel) Hi there, Zoe is the new member of the editing team. Let me know if you need anything from me. [16:21:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.68% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:22:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P58879 and previous config saved to /var/cache/conftool/dbconfig/20240321-162200-marostegui.json [16:22:41] (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:42] (03CR) 10Ahmon Dancy: [C:03+1] gitlab: temporary allow dockerfile frontend on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [16:24:29] (03CR) 10Ssingh: [C:03+2] sre.dns.roll-reboot: fix typo in depool_sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/1013336 (owner: 10Ssingh) [16:25:24] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=registry1003.eqiad.wmnet [16:25:58] !log expand vram for registry100[3,4] from 4G to 6G - T360637 [16:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:01] T360637: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637 [16:26:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:27:53] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM registry1003.eqiad.wmnet [16:30:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:33:35] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye [16:34:24] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dns-rec and not P{dns1004*} and A:dnsbox [16:34:27] (03CR) 10JMeybohm: [C:03+1] admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol) [16:35:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:35:44] (03CR) 10JMeybohm: [C:03+1] external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [16:35:52] !log edit /etc/network/interfaces on registry1003 (ens5 => ens13) - T360637 [16:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:56] T360637: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637 [16:36:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [16:36:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650500 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [16:36:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov2005.codfw.wmnet with OS bullseye [16:36:34] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650501 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye executed w... [16:37:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P58880 and previous config saved to /var/cache/conftool/dbconfig/20240321-163708-marostegui.json [16:37:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:37:17] (JobUnavailable) firing: Reduced availability for job docker-registry in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:38:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1003.eqiad.wmnet [16:38:45] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=registry1003.eqiad.wmnet [16:38:58] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=registry1004.eqiad.wmnet [16:39:02] (03CR) 10JMeybohm: "Quick question right away: Does it make sense to start with a "version including" naming scheme right away? Will there be pytorch2.2 and p" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [16:39:04] (03CR) 10MVernon: [V:03+2 C:03+2] Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:39:39] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM registry1004.eqiad.wmnet [16:40:00] (03PS1) 10EoghanGaffney: [gitlab] Switch gitlab-replica from gitlab1004 to gitlab1003 [puppet] - 10https://gerrit.wikimedia.org/r/1013339 (https://phabricator.wikimedia.org/T358559) [16:40:25] (SystemdUnitFailed) firing: build-homepage.service on registry1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:26] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman) [16:41:57] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:42:17] (03CR) 10Elukey: "This is a good point, I had in my mind the idea that only one pytorch version will be canonical in the future, but for sure we'll end up h" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [16:42:17] (JobUnavailable) resolved: Reduced availability for job docker-registry in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:42:23] !incidents [16:42:23] 4532 (UNACKED) ProbeDown sre (10.2.2.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 eqiad) [16:42:23] 4531 (RESOLVED) [2x] ProbeDown sre (ip4 aqs:7232 probes/service http_aqs_ip4) [16:42:23] 4530 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [16:42:24] 4529 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [16:42:24] 4528 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [16:42:43] !ack 4532 [16:42:43] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:43] 4532 (ACKED) ProbeDown sre (10.2.2.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 eqiad) [16:42:56] o/, herron known? [16:42:58] herron: oooff sorry [16:43:01] is that expected/related? not known to me [16:43:21] elukey: ha no worries, you are working on it? [16:43:22] I am bumping the ram on the eqiad registry vms, but one is up (I am working on the other one [16:43:28] not sure why it paged [16:43:41] ack ok, thank you [16:44:28] !log edit /etc/network/interfaces on registry1004 (ens5 => ens13) - T360637 [16:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:32] T360637: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637 [16:46:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1004.eqiad.wmnet [16:46:39] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=registry1004.eqiad.wmnet [16:46:57] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:09] jouncebot: nowandnext [16:47:09] For the next 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1600) [16:47:09] In 0 hour(s) and 12 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1700) [16:47:09] In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1700) [16:47:12] this makes zero sense to me herron [16:47:25] did the interface name change on the reboot? [16:47:26] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:36] I was working on registry1004 (depooled) and registry1003 was pooled and up [16:47:43] yes yes it changed, I had to fix it etc.. [16:47:47] but the other host was up [16:48:44] registry1003 has an uptime of 19min, is that expected? [16:48:59] if it's like other blackbox checks applied to the role it checks the same virtual host on all backends [16:49:02] yes yes I worked on that as well, before 1004 [16:49:07] ah [16:50:44] (03CR) 10Jdlrobson: [C:04-1] "Not necessary to change these files - they are just static snapshots." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012989 (https://phabricator.wikimedia.org/T359983) (owner: 10Mabualruz) [16:51:28] mutante: o/ but sre.ganeti.reboot-vm does the downtime etc.., so in theory the host on which I was working on shouldn't have alarmed [16:52:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T356166)', diff saved to https://phabricator.wikimedia.org/P58881 and previous config saved to /var/cache/conftool/dbconfig/20240321-165215-marostegui.json [16:52:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [16:52:23] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [16:52:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [16:52:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T356166)', diff saved to https://phabricator.wikimedia.org/P58882 and previous config saved to /var/cache/conftool/dbconfig/20240321-165240-marostegui.json [16:52:51] elukey: hmm.. it's not unheard of that we had "failed to set downtime" in cookbook [16:52:54] are the blackbox probes per-host, or do they go against the service address? [16:52:54] (03PS2) 10Elukey: Add the amd-pytorch base image for ML workloads [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) [16:53:30] cdanis: no idea, I thought the service but I didn't check before the maintenance, my bad [16:53:40] I was pretty sure service address as well [16:53:44] but I don't actually know [16:54:22] in theory what alarmed was the http_docker-registry_ip4 [16:55:17] I am wondering if for some reason registry1003 was not completely up when I worked on 1004, service wise [16:56:36] nope access logs are good for 1003 [16:56:38] (03CR) 10JMeybohm: global_config: rework external services data structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [16:59:40] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [16:59:54] this should answer the question if it's on each backend or not, looks like not: [17:00:00] https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*registry.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [17:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1700) [17:00:12] ^ all probe results matching *registry* [17:00:39] thanks it matches with what I found as well [17:00:46] (03PS11) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [17:00:55] at this point the only thing that I can think of is that I was too quick in moving to the other node [17:01:22] (03PS1) 10Fabfur: benthos/haproxy: delete some fields that aren't in curr webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1013341 (https://phabricator.wikimedia.org/T360642) [17:01:28] (03PS12) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [17:01:57] sorry for the noise folks! [17:02:50] (03CR) 10CI reject: [V:04-1] global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [17:03:09] probably just unlucky timing, ack [17:03:27] (03PS1) 10Jdlrobson: Support legacy message box styles markup in JavaScript [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633) [17:04:06] * elukey afk o/ [17:04:24] (03PS13) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [17:06:01] !log restarting decommissions (restbase1024-{b,c}) — T360548 [17:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:18] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [17:07:31] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1683/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [17:08:30] (03CR) 10Brouberol: [V:03+1] global_config: rework external services data structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [17:09:33] (03PS1) 10Dzahn: delete rt.discovery.wmnet certificate, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013345 (https://phabricator.wikimedia.org/T360413) [17:11:01] (03PS1) 10Reedy: GenerateFancyCaptchas: Include stderr result if captcha.py returns an error code [extensions/ConfirmEdit] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013257 (https://phabricator.wikimedia.org/T360653) [17:11:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:13:59] (03PS1) 10Dzahn: delete rt.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1013367 (https://phabricator.wikimedia.org/T360413) [17:14:12] (03PS2) 10Dzahn: delete rt.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1013367 (https://phabricator.wikimedia.org/T360413) [17:14:36] (03CR) 10Dzahn: [V:03+2 C:03+2] delete rt.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1013367 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:15:16] (03CR) 10Dzahn: [C:03+2] delete rt.discovery.wmnet certificate, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013345 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:15:40] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9650773 (10Dzahn) [17:16:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 36.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:17:46] (03PS2) 10Dzahn: planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) [17:18:53] (03CR) 10CI reject: [V:04-1] planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:19:40] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:20:21] (03PS1) 10Cparle: MachineVision extension is being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) [17:22:56] (03CR) 10CI reject: [V:04-1] Support legacy message box styles markup in JavaScript [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633) (owner: 10Jdlrobson) [17:23:04] (03CR) 10Reedy: [C:03+2] GenerateFancyCaptchas: Include stderr result if captcha.py returns an error code [extensions/ConfirmEdit] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013257 (https://phabricator.wikimedia.org/T360653) (owner: 10Reedy) [17:24:25] (03PS14) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [17:24:33] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012765 (owner: 10Andrew Bogott) [17:27:38] (03PS2) 10Cparle: MachineVision extension is being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) [17:28:13] (03PS3) 10Cparle: MachineVision extension is being sunsetted, so stop doing dumps [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) [17:30:55] 06SRE, 06Traffic, 13Patch-For-Review: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9650886 (10cmooney) [17:35:25] (SystemdUnitFailed) resolved: build-homepage.service on registry1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:13] (03Merged) 10jenkins-bot: GenerateFancyCaptchas: Include stderr result if captcha.py returns an error code [extensions/ConfirmEdit] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013257 (https://phabricator.wikimedia.org/T360653) (owner: 10Reedy) [18:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1800) [18:00:17] !log reedy@deploy1002 Synchronized php-1.42.0-wmf.23/extensions/ConfirmEdit/maintenance/GenerateFancyCaptchas.php: T360653 (duration: 16m 00s) [18:00:25] oooh perfect timing [18:00:29] T360653: GenerateFancyCaptchas doesn't output errors relating to running captcha.py - https://phabricator.wikimedia.org/T360653 [18:00:29] nice work Reedy [18:00:33] :D [18:00:58] All clear? [18:01:11] yup :) [18:01:32] Alright. Pressing the button [18:01:44] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013378 (https://phabricator.wikimedia.org/T354441) [18:01:46] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013378 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [18:02:30] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013378 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [18:06:08] (03PS1) 10Andrew Bogott: base: remove profile::base::manage_timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/1013382 [18:12:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 31.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:13:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 938.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:13:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [18:14:06] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9651230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [18:16:24] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.23 refs T354441 [18:16:28] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [18:17:41] (03CR) 10Dreamy Jazz: "I'd prefer that tests exist for the script before we run it automatically on all wikis, but not a deal breaker to me." [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [18:18:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 918.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:21:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T356166)', diff saved to https://phabricator.wikimedia.org/P58884 and previous config saved to /var/cache/conftool/dbconfig/20240321-182117-marostegui.json [18:21:22] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:22:39] (03CR) 10Dzahn: [C:04-1] "ERROR: Failed to parse hieradata/role/common/planet.yaml: (hieradata/role/common/planet.yaml): did not find expected alphabetic or numeric" [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:25:43] (03PS1) 10Ahmon Dancy: logstash_checker.py: Fix error reporting bug [puppet] - 10https://gerrit.wikimedia.org/r/1013385 [18:26:16] (03PS3) 10Dzahn: planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) [18:27:32] (03CR) 10CI reject: [V:04-1] planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:27:49] (03CR) 10Dzahn: [C:03+2] logstash_checker.py: Fix error reporting bug [puppet] - 10https://gerrit.wikimedia.org/r/1013385 (owner: 10Ahmon Dancy) [18:30:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye [18:32:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 36.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:33:23] (03CR) 10Krinkle: mediawiki.yaml: Use static.php to serve www.mediawiki.org/ontology/ontology.owl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [18:34:21] (03PS4) 10Dzahn: planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) [18:34:54] (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [18:35:34] (03PS1) 10Dzahn: delete planet.discovery.wmnet certificate, switched to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013387 (https://phabricator.wikimedia.org/T360413) [18:36:03] (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [18:36:13] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9651386 (10Papaul) @MoritzMuehlenhoff i tried again the re-image once the server reboots after the OS install the cookbook failed with error below. ` Excep... [18:36:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P58886 and previous config saved to /var/cache/conftool/dbconfig/20240321-183625-marostegui.json [18:36:37] (03PS1) 10Dzahn: delete planet.discovery.wmnet key [labs/private] - 10https://gerrit.wikimedia.org/r/1013388 (https://phabricator.wikimedia.org/T360413) [18:39:01] (03PS2) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [18:40:11] (03PS3) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [18:41:15] (03CR) 10Dreamy Jazz: "We may want to wait until we have a date for deployment and wait to merge this until deployment is not far away to avoid the API and Speci" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [18:42:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:24] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013382 (owner: 10Andrew Bogott) [18:47:16] (03PS15) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [18:47:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012765 (owner: 10Andrew Bogott) [18:51:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P58887 and previous config saved to /var/cache/conftool/dbconfig/20240321-185132-marostegui.json [18:51:59] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:00] (03PS3) 10Ahmon Dancy: mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) [18:54:00] (03PS1) 10Ahmon Dancy: Route /w/docs/ to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807) [18:54:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:54:47] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:54:59] !log removing IPv6 VRRP config on codfw core routers for vlan 2018 private1-b-codfw T351534 [18:55:01] (03CR) 10Dzahn: [C:03+2] planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:03] T351534: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534 [18:56:40] (03CR) 10Ahmon Dancy: mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [18:58:08] (03CR) 10Ahmon Dancy: "Analogous to the recent changes made in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012439" [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [18:59:51] (03CR) 10Krinkle: [C:03+1] mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [18:59:54] (03CR) 10Krinkle: [C:03+1] Route /w/docs/ to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [19:00:13] (03CR) 10Dzahn: [C:03+2] "SAN field looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:03:55] (03PS2) 10Jdlrobson: Support legacy message box styles markup in JavaScript [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633) [19:05:07] (03CR) 10Krinkle: Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [19:06:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T356166)', diff saved to https://phabricator.wikimedia.org/P58888 and previous config saved to /var/cache/conftool/dbconfig/20240321-190640-marostegui.json [19:06:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:06:53] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [19:06:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:07:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1246.eqiad.wmnet with reason: Maintenance [19:07:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1246.eqiad.wmnet with reason: Maintenance [19:07:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T356166)', diff saved to https://phabricator.wikimedia.org/P58889 and previous config saved to /var/cache/conftool/dbconfig/20240321-190723-marostegui.json [19:08:17] (03PS2) 10Dzahn: delete planet.discovery.wmnet key [labs/private] - 10https://gerrit.wikimedia.org/r/1013388 (https://phabricator.wikimedia.org/T360413) [19:09:01] (03CR) 10Dzahn: [C:03+2] delete planet.discovery.wmnet certificate, switched to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013387 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:09:36] (03CR) 10Dzahn: [V:03+2 C:03+2] delete planet.discovery.wmnet key [labs/private] - 10https://gerrit.wikimedia.org/r/1013388 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:09:56] !log adding routes to codfw row b hosts towards spine switch IPs on private1-b-codfw T351534 [19:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:00] T351534: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534 [19:10:28] (03CR) 10Dzahn: ssl: delete peopleweb cert, replaced by cfssl provided cert [puppet] - 10https://gerrit.wikimedia.org/r/1013128 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:11:07] (03CR) 10Dzahn: [V:03+2 C:03+2] ssl: delete peopleweb cert, replaced by cfssl provided cert [puppet] - 10https://gerrit.wikimedia.org/r/1013128 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:15:14] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9651556 (10Dzahn) [19:16:59] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:17:02] (03PS1) 10Dzahn: delete etherpad.discovery ssl key, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013393 (https://phabricator.wikimedia.org/T360413) [19:17:44] !log remove VRRP GW IP for vlan 2018 from codfw core routers and add to EVPN switches irb.2018 interface T351534 [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:48] T351534: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534 [19:20:19] (03PS1) 10Dzahn: delete etherpad.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013394 (https://phabricator.wikimedia.org/T360413) [19:20:20] (03PS2) 10Dzahn: delete etherpad.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013393 (https://phabricator.wikimedia.org/T360413) [19:22:35] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[28,32,34-36].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002 [19:22:40] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [19:36:31] 06SRE, 10Wikimedia-Mailing-lists: 14Create a mailing list for plwiki arbcom - 14https://phabricator.wikimedia.org/T360682#9651619 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup 14{{done}} https://lists.wikimedia.org/postorius/lists/wikipedia-pl-arbcom.lists.wikimedia.org/ Please let me know if you h... [19:37:56] (03PS1) 10Bking: elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) [19:39:26] (03CR) 10CI reject: [V:04-1] elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [19:39:34] (03PS2) 10Ryan Kemper: elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [19:40:45] (03CR) 10CI reject: [V:04-1] elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [19:41:38] (03PS3) 10Bking: elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) [19:41:49] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [19:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:52:02] (03CR) 10Ryan Kemper: [C:03+1] elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [19:52:10] (03CR) 10Bking: [C:03+2] elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [19:59:29] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T2000) [20:00:05] jan_drewniak and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:38] o/ [20:02:16] hi jan_drewniak - do you want to self-deploy? [20:02:38] i'm happy to do both of ours if you prefer [20:04:10] Hi cjming ! if you could do both that'd be great (I think you can do two at once with scap backport 1013255 1009718) [20:04:35] alrighty - i'll start in [20:06:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633) (owner: 10Jdlrobson) [20:10:48] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-b-codfw entries - cmooney@cumin1002" [20:11:04] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[28,32,34-36].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002 [20:11:08] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [20:11:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-b-codfw entries - cmooney@cumin1002" [20:11:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:11:44] (03PS1) 10Bking: elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) [20:12:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:12:59] (03CR) 10CI reject: [V:04-1] elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:13:49] (03PS2) 10Bking: elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) [20:14:15] !log deleting irb.2018 interfaces from codfw spine switches T351534 [20:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:20] T351534: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534 [20:15:44] (03CR) 10Ryan Kemper: [C:03+1] elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:16:01] (03CR) 10Bking: [C:03+2] elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:16:34] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:16:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:18:45] (ProbeDown) firing: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:22:41] (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:27] (03CR) 10Gergő Tisza: [C:03+1] Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [20:23:30] (ProbeDown) resolved: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:56] (03CR) 10Gergő Tisza: [C:03+1] "Scheduled for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [20:27:19] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:27:23] (03Merged) 10jenkins-bot: Support legacy message box styles markup in JavaScript [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633) (owner: 10Jdlrobson) [20:27:53] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1013255|Support legacy message box styles markup in JavaScript (T360633)]] [20:27:57] T360633: Non-codex legacy MW message box related styles are not being applied on Vector 2022 - https://phabricator.wikimedia.org/T360633 [20:27:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:28:18] 06SRE, 10ChangeProp, 06Commons, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9651946 (10Krinkle) [20:29:42] (03PS1) 10Bking: elastic: move elastic2037 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1013401 (https://phabricator.wikimedia.org/T358882) [20:34:53] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-b-codfw entries - cmooney@cumin1002" [20:35:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878 [20:35:14] T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 [20:35:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-b-codfw entries - cmooney@cumin1002" [20:35:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:35:56] 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9651962 (10Peachey88) [20:37:03] (03CR) 10Ryan Kemper: [C:03+1] elastic: move elastic2037 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1013401 (https://phabricator.wikimedia.org/T358882) (owner: 10Bking) [20:37:05] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878 [20:37:10] (03CR) 10Bking: [C:03+2] elastic: move elastic2037 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1013401 (https://phabricator.wikimedia.org/T358882) (owner: 10Bking) [20:42:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T356166)', diff saved to https://phabricator.wikimedia.org/P58891 and previous config saved to /var/cache/conftool/dbconfig/20240321-204249-marostegui.json [20:42:54] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [20:43:57] !log deleting irb.2001 and irb.2002 interfaces from codfw spine switches [20:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:32] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1013255|Support legacy message box styles markup in JavaScript (T360633)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:45:40] T360633: Non-codex legacy MW message box related styles are not being applied on Vector 2022 - https://phabricator.wikimedia.org/T360633 [20:46:05] jan_drewniak: not sure why it took so long but your patch can be tested now [20:47:39] cjming: ok looks great, good to sync [20:47:46] !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync [20:50:33] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[29,32,37-39,25-27,30,33,40-42].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002 [20:50:37] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [20:57:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P58892 and previous config saved to /var/cache/conftool/dbconfig/20240321-205756-marostegui.json [20:58:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:58:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:59:30] dancy: Hey, it seems the train might have broken captchas [20:59:49] Awesome. Rollback needed? [21:00:01] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic20[89-99]\.codfw\.wmnet [21:00:03] Based on https://grafana.wikimedia.org/d/000000370/captcha-failure-rates?orgId=1 yeah [21:00:09] i'm still finishing up the window - can i finish one more config change? [21:00:27] >18:16 dancy@deploy1002: rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.23 refs T354441 [21:00:27] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [21:00:43] That 1816 seems to correlate with the bottom graph going from ~50 to 100% [21:00:47] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic210[0-9]\.codfw\.wmnet [21:01:15] And looks like it increases a bit in the previous ~24-48 hours (presumably as other parts of the train rolled) [21:01:15] cjming: I can roll back when your stuff is done. [21:01:29] !log adding routes to codfw row a hosts towards spine switch IPs on private1-a-codfw T351532 [21:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:41] T351532: Migrate IP gateway for public1-a-codfw to spine switches - https://phabricator.wikimedia.org/T351532 [21:02:11] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic20[89]\.codfw\.wmnet [21:02:34] dancy: thanks! just hopefully a quick config change -- the one backport seemed to take forever [21:02:52] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic209[0-9]\.codfw\.wmnet [21:03:01] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1013255|Support legacy message box styles markup in JavaScript (T360633)]] (duration: 35m 07s) [21:03:05] T360633: Non-codex legacy MW message box related styles are not being applied on Vector 2022 - https://phabricator.wikimedia.org/T360633 [21:03:05] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic2089\.codfw\.wmnet [21:03:35] (03PS4) 10Clare Ming: ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx) [21:03:59] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878 [21:04:08] T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 [21:04:40] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:04:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx) [21:04:57] (03CR) 10TrainBranchBot: "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx) [21:06:00] !log deleting VRRP GW for 10.192.0.1 / private1-a-codfw from codfw core routers and adding to leaf switches row A T351532 [21:06:01] (03Merged) 10jenkins-bot: ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx) [21:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:18] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]] [21:06:24] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9652068 (10bking) We're going to upgrade curator (as well as its library)... [21:06:30] T352342: QA WebUIScroll port to the new metrics platform - https://phabricator.wikimedia.org/T352342 [21:08:45] !log cjming@deploy1002 cjming and phuedx: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:50] !log cjming@deploy1002 cjming and phuedx: Continuing with sync [21:13:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P58893 and previous config saved to /var/cache/conftool/dbconfig/20240321-211303-marostegui.json [21:14:18] 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9652082 (10Krinkle) In MediaWiki (as deployed at WMF), there exists 1 use of Redis, which is during file uploads via... [21:20:42] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]] (duration: 14m 24s) [21:20:47] T352342: QA WebUIScroll port to the new metrics platform - https://phabricator.wikimedia.org/T352342 [21:20:49] !log end of UTC late backport window [21:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:59] dancy: all yours - thanks for your patience [21:21:04] thx [21:21:30] Reedy: is there a ticket for that issue? [21:22:00] T360717 [21:22:00] T360717: CAPTCHA failure rate at 100% - https://phabricator.wikimedia.org/T360717 [21:22:04] thx [21:22:15] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013404 (https://phabricator.wikimedia.org/T354441) [21:22:16] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013404 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [21:22:21] Amir has noticed it seems to be doing requests to codfw, which is odd [21:22:59] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013404 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [21:23:17] !log deleting irb.2017 interface from ssw1-a1-codfw and ssw1-a8-codfw [21:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:40] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:27:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:28:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T356166)', diff saved to https://phabricator.wikimedia.org/P58894 and previous config saved to /var/cache/conftool/dbconfig/20240321-212811-marostegui.json [21:28:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:28:15] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [21:28:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:29:21] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [21:32:15] lol captcha failure rate is at 50% in API [21:32:17] guess whyyyyyy [21:32:22] Reedy: ^ [21:32:32] ? [21:33:01] eqiad / codfw split I think [21:34:11] dancy: are you deploying the revert? I want to check something [21:34:26] let me know once done [21:34:41] rollback is in progress. I just paused it before it has done anything more than update wikiversions.json [21:35:37] thanks! [21:39:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [21:39:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9652188 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [21:41:40] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T360722 (10phaultfinder) 03NEW [21:42:02] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:42:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:43:07] I have a feeling I know what's going on and I think the train rollback won't help but better to wait and make sure, once that's proven I try my thing [21:44:05] OK. I'm going to need to step out to pick up my son during the rollback. [21:44:39] can I do anything to move it over? [21:44:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dns-rec and not P{dns1004*} and A:dnsbox [21:44:54] sure! You can run `scap train`! [21:45:12] I ended up cancelling the last run, so you could re-run and tell it that you want to be at group1 (option 3) [21:45:27] awesome [21:45:29] or, at this stage, just `scap sync-wikiversions` is sufficient [21:45:36] sure [21:46:04] Thanks! [22:00:41] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9652267 (10Volans) >>! In T345337#9652068, @bking wrote: > We're going to... [22:02:54] 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9652274 (10Ladsgroup) >>! In T360596#9652082, @Krinkle wrote: > In MediaWiki (as deployed at WMF), there exists 1 use... [22:05:01] !log ladsgroup@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided) [22:06:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2093-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:10:52] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9652299 (10bking) > The linked task is this same one. Did you meant to li... [22:11:39] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2093-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:15:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye [22:18:50] (03PS2) 10Dzahn: etherpad: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413) [22:22:31] (03CR) 10Dzahn: [C:03+2] etherpad: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [22:27:09] (03CR) 10Dzahn: [C:03+2] "before:" [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [22:31:23] (03PS2) 10Dzahn: delete etherpad.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013394 (https://phabricator.wikimedia.org/T360413) [22:34:51] (03CR) 10Dzahn: [V:03+2 C:03+2] delete etherpad.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013394 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [22:36:56] (03CR) 10Dzahn: [C:03+2] delete etherpad.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013393 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [22:36:57] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9652378 (10Andrew) I ran a dist-upgrade on cloudcontrol2001, 2003, 2004, 1005, 1006, 1007. [22:39:19] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878 [22:39:23] T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 [22:39:35] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-a-codfw entries - cmooney@cumin1002" [22:40:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-a-codfw entries - cmooney@cumin1002" [22:40:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:41:08] !log etherpad - switching cert provider to cfssl [22:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:36] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@9607731]: Add canary events generation dag in Airflow [airflow-dags/analytics@9607731b] [22:42:05] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@9607731]: Add canary events generation dag in Airflow [airflow-dags/analytics@9607731b] (duration: 00m 29s) [22:56:49] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[29,32,37-39,25-27,30,33,40-42].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002 [22:56:53] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [23:09:21] (03CR) 10Dzahn: releases: switch SSL cert provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [23:10:24] (03PS2) 10Dzahn: releases: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) [23:11:32] (03PS1) 10Dzahn: ssl: delete releases.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013414 (https://phabricator.wikimedia.org/T360413) [23:11:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2092-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [23:12:13] (03PS1) 10Dzahn: ssl: delete aphlict.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013415 (https://phabricator.wikimedia.org/T360413) [23:13:09] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:13:15] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:14:56] (03PS1) 10Dzahn: aphlict: switch envoy cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413) [23:15:40] (03PS1) 10Dzahn: delete aphlict.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013417 (https://phabricator.wikimedia.org/T360413) [23:16:11] (03PS1) 10Dzahn: delete releases.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013418 (https://phabricator.wikimedia.org/T360413) [23:16:39] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2092-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [23:17:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:50] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002 [23:17:55] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [23:18:02] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9652460 (10Dzahn) [23:19:18] (03PS1) 10Dzahn: delete doc.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013419 (https://phabricator.wikimedia.org/T360413) [23:20:23] (03PS1) 10Dzahn: ssl: delete doc.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013420 (https://phabricator.wikimedia.org/T360413) [23:22:49] (03PS1) 10Dzahn: doc: switch envoy ssl cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013421 (https://phabricator.wikimedia.org/T360413) [23:32:45] 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9652472 (10bd808) [23:34:52] (03PS1) 10TrainBranchBot: all wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013422 (https://phabricator.wikimedia.org/T354441) [23:34:54] (03CR) 10TrainBranchBot: [C:03+2] all wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013422 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [23:35:39] (03Merged) 10jenkins-bot: all wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013422 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [23:46:18] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:46:25] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:49:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:49:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:50:18] !log reedy@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.42.0-wmf.22 refs T354441 [23:50:22] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [23:52:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:53:01] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:54:33] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@582ad55]: Add params to canary events pipeline [airflow-dags/analytics@582ad55c] [23:54:58] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@582ad55]: Add params to canary events pipeline [airflow-dags/analytics@582ad55c] (duration: 00m 25s) [23:59:51] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:59:58] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply