[00:27:41] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_nginx.service on apt2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:29:17] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9648018 (10Ladsgroup) Something to consider: {T360589}
[00:37:50] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012662
[00:37:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012662 (owner: 10TrainBranchBot)
[00:44:35] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:52:39] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:52:45] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:01:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012662 (owner: 10TrainBranchBot)
[01:04:35] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[01:07:58] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9648076 (10ssingh) Hi Rob: Checking if the date/time above has been confirmed by remote hands?
[01:20:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:20:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:24:22] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:24:29] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:46:30] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:46:37] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:50:17] <wikibugs>	 (03PS1) 10Pppery: Update links to point to non-wiki privacy policy and bypass redirects [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129)
[02:00:27] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:00:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:06:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:06:40] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:12:17] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:12:24] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:16:37] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:16:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:19:57] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:20:04] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:24:04] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:24:11] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:27:10] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:27:17] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:28:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:37:17] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:42:14] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:10:58] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:11:05] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:17:17] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:33:28] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:33:36] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:38:08] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:38:15] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:03:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 830.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:08:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 837.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:27:41] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_nginx.service on apt2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:49:36] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[05:09:36] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[05:26:27] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:26:34] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:27:42] * kart_ will deploy cxserver..
[05:27:49] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-03-20-072017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013047 (https://phabricator.wikimedia.org/T352739) (owner: 10KartikMistry)
[05:28:44] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2024-03-20-072017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013047 (https://phabricator.wikimedia.org/T352739) (owner: 10KartikMistry)
[05:31:03] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:31:31] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:32:03] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:32:39] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:33:18] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:33:57] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:36:26] <kart_>	 !log Updated cxserver to 2024-03-20-072017-production (T352739)
[05:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:31] <stashbot>	 T352739: cxserver: Cannot read properties of undefined (reading 'pages') - https://phabricator.wikimedia.org/T352739
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T0600)
[06:00:04] <jouncebot>	 kormat, marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T0600)
[06:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:20:24] <wikibugs>	 (03PS1) 10Marostegui: es2023: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1013162 (https://phabricator.wikimedia.org/T358746)
[06:21:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on es[2023-2025].codfw.wmnet with reason: Migrate to 10.6
[06:21:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2023-2025].codfw.wmnet with reason: Migrate to 10.6
[06:22:01] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2023: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1013162 (https://phabricator.wikimedia.org/T358746) (owner: 10Marostegui)
[06:24:03] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage es2035 [puppet] - 10https://gerrit.wikimedia.org/r/1013163
[06:25:12] <marostegui>	 !log dbmaint deploy schema change s2 codfw T356166
[06:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:16] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[06:25:33] <wikibugs>	 (03PS2) 10Tim Starling: SwiftTooManyMediaUploads: reduce severity [alerts] - 10https://gerrit.wikimedia.org/r/1010347
[06:25:38] <marostegui>	 !log dbmaint deploy schema change s1 codfw T356166
[06:25:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on 17 hosts with reason: Schema change T356166
[06:26:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 17 hosts with reason: Schema change T356166
[06:27:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on 12 hosts with reason: Schema change T356166
[06:27:59] <marostegui>	 !log dbmaint deploy schema change s3 codfw T356166
[06:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 12 hosts with reason: Schema change T356166
[06:28:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:29:17] <marostegui>	 !log dbmaint deploy schema change s1 codfw T355609
[06:29:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:22] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[06:29:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 17 hosts with reason: Schema change T356166
[06:30:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 17 hosts with reason: Schema change T356166
[06:30:22] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[06:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:42:14] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:43:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2035 [puppet] - 10https://gerrit.wikimedia.org/r/1013163 (owner: 10Marostegui)
[06:51:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[06:52:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[06:52:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[06:52:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[06:52:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T356166)', diff saved to https://phabricator.wikimedia.org/P58845 and previous config saved to /var/cache/conftool/dbconfig/20240321-065232-marostegui.json
[06:52:36] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[06:54:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T356166)', diff saved to https://phabricator.wikimedia.org/P58846 and previous config saved to /var/cache/conftool/dbconfig/20240321-065446-marostegui.json
[07:01:46] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:01:53] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:04:21] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:09:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P58847 and previous config saved to /var/cache/conftool/dbconfig/20240321-070954-marostegui.json
[07:12:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: wikifeeds: scale up resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013165
[07:19:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:19:37] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:19:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] wikifeeds: scale up resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013165 (owner: 10Giuseppe Lavagetto)
[07:20:49] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: scale up resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013165 (owner: 10Giuseppe Lavagetto)
[07:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:22:48] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[07:23:06] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[07:24:49] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:24:56] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:25:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P58848 and previous config saved to /var/cache/conftool/dbconfig/20240321-072501-marostegui.json
[07:28:10] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:28:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:33:04] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:33:11] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:37:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:37:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:40:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T356166)', diff saved to https://phabricator.wikimedia.org/P58849 and previous config saved to /var/cache/conftool/dbconfig/20240321-074009-marostegui.json
[07:40:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[07:40:14] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[07:40:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[07:40:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T356166)', diff saved to https://phabricator.wikimedia.org/P58850 and previous config saved to /var/cache/conftool/dbconfig/20240321-074032-marostegui.json
[07:43:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:43:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.puppet.renew-cert: Extend help text for --installer [cookbooks] - 10https://gerrit.wikimedia.org/r/1013012 (owner: 10Muehlenhoff)
[07:50:11] <wikibugs>	 (03PS1) 10Slyngshede: site: Add new IDP production hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748)
[07:54:01] <wikibugs>	 (03PS1) 10Anzx: dewiki: Enable mobile page tabs for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012115 (https://phabricator.wikimedia.org/T360246)
[07:55:01] <wikibugs>	 (03PS5) 10Anzx: knwikisource, knwiktionary: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010881 (https://phabricator.wikimedia.org/T360022)
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T0800)
[08:00:05] <jouncebot>	 anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:07] <anzx>	 o/
[08:04:24] <wikibugs>	 (03CR) 10Muehlenhoff: site: Add new IDP production hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[08:05:51] <wikibugs>	 (03PS2) 10Slyngshede: site: Add new IDP production hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748)
[08:05:59] <wikibugs>	 (03CR) 10Slyngshede: site: Add new IDP production hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[08:27:41] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_nginx.service on apt2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:33:37] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: add parameter for optional log length [puppet] - 10https://gerrit.wikimedia.org/r/1013114 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[08:35:57] <wikibugs>	 (03CR) 10JMeybohm: profile::prometheus::k8s: move istio metrics to a separate job (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[08:40:37] <fabfur>	 !log repooling cp4037 for about ~30m (T358109)
[08:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:43] <stashbot>	 T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109
[08:40:44] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[08:46:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] SwiftTooManyMediaUploads: reduce severity [alerts] - 10https://gerrit.wikimedia.org/r/1010347 (owner: 10Tim Starling)
[08:47:45] <wikibugs>	 (03PS2) 10Jelto: gitlab: temporary allow dockerfile frontend on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612)
[08:49:50] <wikibugs>	 (03CR) 10Jelto: gitlab: temporary allow dockerfile frontend on Trusted Runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto)
[08:54:36] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[08:55:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[08:57:09] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff)
[08:57:16] <wikibugs>	 (03PS3) 10Jcrespo: mediabackup::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff)
[08:57:31] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] site: Add new IDP production hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013166 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[08:58:16] <wikibugs>	 06SRE, 10ChangeProp, 10GitLab, 10MediaWiki-File-management, 10Platform Team Initiatives (API Gateway): Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596 (10akosiaris) 03NEW
[08:58:32] <wikibugs>	 (03PS1) 10Fabfur: benthos: using URIPATH and URIPARAM for parsing corresponding fields [puppet] - 10https://gerrit.wikimedia.org/r/1013225 (https://phabricator.wikimedia.org/T358109)
[08:59:51] <wikibugs>	 (03CR) 10Jcrespo: [V:03+2 C:03+2] mediabackup::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff)
[08:59:55] <wikibugs>	 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 3 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648360 (10akosiaris)
[09:00:57] <wikibugs>	 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 4 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648367 (10Peachey88)
[09:02:18] <wikibugs>	 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 4 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648371 (10MoritzMuehlenhoff)
[09:10:08] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[09:10:39] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos: using URIPATH and URIPARAM for parsing corresponding fields [puppet] - 10https://gerrit.wikimedia.org/r/1013225 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[09:12:13] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp2003.wikimedia.org
[09:12:14] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[09:14:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Delete peopleweb dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013227 (https://phabricator.wikimedia.org/T360413)
[09:14:36] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[09:15:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[09:16:28] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2003.wikimedia.org - slyngshede@cumin1002"
[09:17:19] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2003.wikimedia.org - slyngshede@cumin1002"
[09:17:19] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:17:20] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp2003.wikimedia.org on all recursors
[09:17:23] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp2003.wikimedia.org on all recursors
[09:17:49] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2003.wikimedia.org - slyngshede@cumin1002"
[09:18:41] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2003.wikimedia.org - slyngshede@cumin1002"
[09:19:13] <wikibugs>	 (03PS1) 10Fabfur: benthos: uri_query should be optional [puppet] - 10https://gerrit.wikimedia.org/r/1013228 (https://phabricator.wikimedia.org/T358109)
[09:22:20] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp2003.wikimedia.org with OS bookworm
[09:24:51] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos: uri_query should be optional [puppet] - 10https://gerrit.wikimedia.org/r/1013228 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[09:25:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T356166)', diff saved to https://phabricator.wikimedia.org/P58851 and previous config saved to /var/cache/conftool/dbconfig/20240321-092533-marostegui.json
[09:25:38] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[09:28:12] <wikibugs>	 (03CR) 10Muehlenhoff: planet: switch envoy SSL provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[09:37:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[09:38:01] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp2003.wikimedia.org with reason: host reimage
[09:39:53] <wikibugs>	 (03CR) 10Muehlenhoff: releases: switch SSL cert provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[09:40:29] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp2003.wikimedia.org with reason: host reimage
[09:40:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P58852 and previous config saved to /var/cache/conftool/dbconfig/20240321-094041-marostegui.json
[09:42:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1013090 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah)
[09:46:22] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw
[09:55:18] <wikibugs>	 (03CR) 10Muehlenhoff: AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol)
[09:55:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P58853 and previous config saved to /var/cache/conftool/dbconfig/20240321-095548-marostegui.json
[09:59:13] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp2003.wikimedia.org with OS bookworm
[09:59:13] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp2003.wikimedia.org
[09:59:48] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp1003.wikimedia.org
[09:59:50] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[10:00:13] <fabfur>	 !log repooling cp4037 for about ~30m (this is last time I'll notice here, no need for this in the future) (T358109)
[10:00:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:24] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[10:00:24] <stashbot>	 T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109
[10:01:31] <Emperor>	 !log update ceph-reef packages to 18.2.2 on apt.wm.org
[10:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:12] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1003.wikimedia.org - slyngshede@cumin1002"
[10:03:40] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1003.wikimedia.org - slyngshede@cumin1002"
[10:03:40] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:03:41] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp1003.wikimedia.org on all recursors
[10:03:43] <wikibugs>	 (03PS1) 10Fabfur: benthos: fix optional space in grok pattern (when no uri_query present) [puppet] - 10https://gerrit.wikimedia.org/r/1013232 (https://phabricator.wikimedia.org/T358109)
[10:03:44] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp1003.wikimedia.org on all recursors
[10:04:08] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1003.wikimedia.org - slyngshede@cumin1002"
[10:04:47] <wikibugs>	 (03PS5) 10Brouberol: AQS1.0: disable aqs service [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522)
[10:05:00] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1003.wikimedia.org - slyngshede@cumin1002"
[10:05:36] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp1003.wikimedia.org with OS bookworm
[10:06:14] <wikibugs>	 (03CR) 10Brouberol: AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol)
[10:09:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol)
[10:10:14] <wikibugs>	 (03PS3) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508)
[10:10:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T356166)', diff saved to https://phabricator.wikimedia.org/P58854 and previous config saved to /var/cache/conftool/dbconfig/20240321-101056-marostegui.json
[10:10:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[10:11:01] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[10:11:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[10:11:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T356166)', diff saved to https://phabricator.wikimedia.org/P58855 and previous config saved to /var/cache/conftool/dbconfig/20240321-101119-marostegui.json
[10:11:40] <wikibugs>	 (03PS2) 10Fabfur: benthos: fix optional space in grok pattern (when no uri_query present) [puppet] - 10https://gerrit.wikimedia.org/r/1013232 (https://phabricator.wikimedia.org/T358109)
[10:13:22] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw
[10:15:59] <wikibugs>	 (03PS24) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[10:16:38] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos: fix optional space in grok pattern (when no uri_query present) [puppet] - 10https://gerrit.wikimedia.org/r/1013232 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:17:30] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp1003.wikimedia.org with reason: host reimage
[10:20:07] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp1003.wikimedia.org with reason: host reimage
[10:28:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1153.eqiad.wmnet with OS bookworm
[10:29:51] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad
[10:30:07] <wikibugs>	 (03PS1) 10Phuedx: Update mediawiki.web_ui_actions stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013234 (https://phabricator.wikimedia.org/T353029)
[10:31:40] <wikibugs>	 (03PS2) 10Muehlenhoff: prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074
[10:31:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] AQS1.0: disable aqs service [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol)
[10:31:50] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff)
[10:32:21] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[10:34:05] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp1003.wikimedia.org with OS bookworm
[10:34:05] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp1003.wikimedia.org
[10:38:40] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s: haproxy: Do not start keepalived too early [puppet] - 10https://gerrit.wikimedia.org/r/1013090 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah)
[10:39:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] installserver: update centrallog partman [puppet] - 10https://gerrit.wikimedia.org/r/1013068 (https://phabricator.wikimedia.org/T359451) (owner: 10Filippo Giunchedi)
[10:41:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1153.eqiad.wmnet with reason: host reimage
[10:42:14] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:43:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1153.eqiad.wmnet with reason: host reimage
[10:50:01] <wikibugs>	 (03PS1) 10Brouberol: superset-next: upgrade to 3.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013236 (https://phabricator.wikimedia.org/T358674)
[10:50:44] <wikibugs>	 (03PS1) 10Slyngshede: R:idp enable new Bookworm hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748)
[10:50:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T356166)', diff saved to https://phabricator.wikimedia.org/P58856 and previous config saved to /var/cache/conftool/dbconfig/20240321-105052-marostegui.json
[10:50:57] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[10:51:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Also disable monitoring for AQS1 [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522)
[10:51:23] <wikibugs>	 (03PS1) 10Brouberol: superset: upgrade to 3.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013239 (https://phabricator.wikimedia.org/T358674)
[10:52:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] profile::prometheus::k8s: move istio metrics to a separate job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[10:52:15] <wikibugs>	 (03CR) 10Slyngshede: "Best way to roll out this patch is probably to disable Puppet on the existing hosts and let the two new hosts come up and verify that they" [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[10:53:06] <wikibugs>	 06SRE, 10ChangeProp, 06Commons, 10GitLab, and 7 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648786 (10taavi)
[10:53:23] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9648799 (10cmooney) >>! In T358244#9636601, @ayounsi wrote: > FYI it's alerting for one of its PSU being down, but we don't really care anymore : >> asw-a-codfw> show syste...
[10:53:39] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Fix progress_bars parameter (should be print_progress_bars) [cookbooks] - 10https://gerrit.wikimedia.org/r/1010559 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney)
[10:53:47] <wikibugs>	 (03CR) 10Muehlenhoff: "There's two separate patches we need to prepare first before this can go live:" [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[10:55:17] <wikibugs>	 (03PS1) 10Majavah: Adapt clean-stale-puppet-certs for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240
[10:55:26] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad
[10:56:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Adapt clean-stale-puppet-certs for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 (owner: 10Majavah)
[10:58:05] <wikibugs>	 (03Merged) 10jenkins-bot: [gitlab] Fix progress_bars parameter (should be print_progress_bars) [cookbooks] - 10https://gerrit.wikimedia.org/r/1010559 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney)
[10:58:16] <wikibugs>	 (03PS1) 10Slyngshede: P:acme_chief::certificates Add new IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013241 (https://phabricator.wikimedia.org/T357748)
[10:58:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1153.eqiad.wmnet with OS bookworm
[10:59:03] <wikibugs>	 (03PS2) 10Majavah: Adapt clean-stale-puppet-certs for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240
[10:59:58] <wikibugs>	 (03CR) 10Brouberol: Also disable monitoring for AQS1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff)
[11:00:05] <jouncebot>	 mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1100)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1100)
[11:00:23] <wikibugs>	 (03PS1) 10Marostegui: db1153: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1013242 (https://phabricator.wikimedia.org/T353499)
[11:04:43] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9648841 (10dcaro)
[11:06:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P58857 and previous config saved to /var/cache/conftool/dbconfig/20240321-110600-marostegui.json
[11:06:21] <wikibugs>	 (03CR) 10Brouberol: Also disable monitoring for AQS1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff)
[11:06:37] <wikibugs>	 (03PS1) 10Slyngshede: P:mariadb::ferm_misc Add new IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1013245
[11:07:30] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9648853 (10dcaro) New hard drives offline uncorrectable values (cloudcephosd1030) are all 0: ` root@cloudcephosd1030...
[11:08:38] <wikibugs>	 (03PS2) 10Slyngshede: R:idp enable new Bookworm hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748)
[11:11:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1153: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1013242 (https://phabricator.wikimedia.org/T353499) (owner: 10Marostegui)
[11:21:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P58860 and previous config saved to /var/cache/conftool/dbconfig/20240321-112108-marostegui.json
[11:23:24] <effie>	 jouncebot now
[11:23:24] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1100)
[11:23:24] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1100)
[11:25:06] <effie>	 mvolz: do you have anything to deploy in the next deployment window?
[11:25:57] <mvolz>	 effie: no, use it if you need to
[11:26:04] <effie>	 excellent!
[11:26:09] <effie>	 thank you!
[11:27:46] <effie>	 Dear Deployers, we will be switching over the deployment server, please refrain from using it until further notice
[11:36:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T356166)', diff saved to https://phabricator.wikimedia.org/P58862 and previous config saved to /var/cache/conftool/dbconfig/20240321-113615-marostegui.json
[11:36:18] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[11:36:31] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[11:36:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T356166)', diff saved to https://phabricator.wikimedia.org/P58863 and previous config saved to /var/cache/conftool/dbconfig/20240321-113638-marostegui.json
[11:46:00] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1200)
[12:00:52] <effie>	 !log disable puppet on deployment servers 
[12:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:26] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[12:15:44] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad
[12:16:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T356166)', diff saved to https://phabricator.wikimedia.org/P58865 and previous config saved to /var/cache/conftool/dbconfig/20240321-121628-marostegui.json
[12:16:32] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[12:18:25] <jinxer-wm>	 (SystemdUnitFailed) firing: imagecatalog_record.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:19:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye
[12:22:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:28:26] <jinxer-wm>	 (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[12:31:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P58866 and previous config saved to /var/cache/conftool/dbconfig/20240321-123135-marostegui.json
[12:34:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2005.codfw.wmnet with reason: host reimage
[12:37:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2005.codfw.wmnet with reason: host reimage
[12:39:05] <logmsgbot>	 !log jiji@deploy1002 Started scap: Check new deployment server (deploy1002) post switchover - March 2024
[12:46:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P58867 and previous config saved to /var/cache/conftool/dbconfig/20240321-124644-marostegui.json
[12:54:39] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1300). nyaa~
[13:00:05] <jouncebot>	 anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:15] <Lucas_WMDE>	 I can’t deploy, sorry
[13:01:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T356166)', diff saved to https://phabricator.wikimedia.org/P58868 and previous config saved to /var/cache/conftool/dbconfig/20240321-130151-marostegui.json
[13:01:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[13:01:55] <claime>	 effie, I don't think you're done with the deploy server switchover, are you?
[13:02:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[13:02:09] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[13:02:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T356166)', diff saved to https://phabricator.wikimedia.org/P58869 and previous config saved to /var/cache/conftool/dbconfig/20240321-130213-marostegui.json
[13:05:12] <effie>	 claime: I am still syncing worl 
[13:05:13] <effie>	 d
[13:05:18] <claime>	 ack
[13:06:14] <wikibugs>	 (03PS7) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850)
[13:06:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[13:07:09] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9648883 (10dcaro)
[13:07:33] <wikibugs>	 (03CR) 10Muehlenhoff: Also disable monitoring for AQS1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff)
[13:07:41] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update articledesc and llm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013269 (https://phabricator.wikimedia.org/T360212)
[13:07:49] <wikibugs>	 (03PS2) 10Muehlenhoff: Also disable monitoring for AQS1 [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522)
[13:08:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013245 (owner: 10Slyngshede)
[13:08:57] <jinxer-wm>	 (ProbeDown) firing: (2) Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aqs:7232 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:09:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013241 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[13:09:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1013270
[13:09:30] <jayme>	 moritzm: is that you?
[13:09:42] <wikibugs>	 (03CR) 10EoghanGaffney: gitlab: fix irc log for backup complete message (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto)
[13:09:42] <akosiaris>	 What s this?
[13:09:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "If I understand correctly we won't be seeing metrics for upstream/downstream that don't see any rps. It's probably ok as a stopgap optimiz" [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi)
[13:10:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:10:24] <akosiaris>	 !incidents
[13:10:25] <sirenbot>	 4531 (UNACKED)  [2x] ProbeDown sre (ip4 aqs:7232 probes/service http_aqs_ip4)
[13:10:25] <sirenbot>	 4530 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams)
[13:10:25] <sirenbot>	 4529 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams)
[13:10:25] <sirenbot>	 4528 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams)
[13:10:30] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] apt-staging: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1012346 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[13:10:38] <akosiaris>	 !ack 4531
[13:10:38] <sirenbot>	 4531 (ACKED)  [2x] ProbeDown sre (ip4 aqs:7232 probes/service http_aqs_ip4)
[13:10:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "I forgot to ask, is this a stopgap? Do we intend to revert it once we 've sorted out other prometheus infrastructure related issues?" [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi)
[13:11:13] <robh>	 sukhe: heyas i was out of it yesterday booster knocked me on my ass so I didn't write up the detailed directions for remote hands esams
[13:11:14] <jayme>	 akosiaris: I think its from https://phabricator.wikimedia.org/T360522
[13:11:30] <robh>	 bah wrong channel meant to state in dc ops lol
[13:11:35] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: update articledesc and llm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013269 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos)
[13:11:49] <jayme>	 and the monitoring disable patch is just a bit late
[13:12:26] <jayme>	 cc brouberol / moritzm
[13:12:47] <wikibugs>	 (03PS8) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850)
[13:13:09] <klausman>	 wow, that Gerrit/wikibugs message sure had a long delay. I +1'd at 13:01 according to the webui, and the bot only mentioned it at 14:11?
[13:13:17] <akosiaris>	 Ok, so it should clear on its own?
[13:13:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[13:13:42] <jayme>	 akosiaris: just a wild guess because of the coincidence
[13:13:44] <brouberol>	 sorry about the false-alarm. We had a monitor CR merged to disable the alarm altogether :/
[13:14:20] <brouberol>	 we disabled AQS probe monitoring and then disabled the AQS service, so that's no coincidence. However, we didn't anticipate the alert still firing 
[13:14:26] <logmsgbot>	 !log jiji@deploy1002 Finished scap: Check new deployment server (deploy1002) post switchover - March 2024 (duration: 35m 20s)
[13:14:39] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[13:14:40] <wikibugs>	 (03PS5) 10MVernon: Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621)
[13:14:48] <wikibugs>	 (03CR) 10MVernon: "Thanks for those two spots; I've corrected both (and updated version to match the newer upstream packages I've pulled to our apt repo), an" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[13:15:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.89% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:15:44] <wikibugs>	 (03PS1) 10Effie Mouzeli: deployment: update deployment DNS record to deploy1002 (switchover #6) [dns] - 10https://gerrit.wikimedia.org/r/1013272 (https://phabricator.wikimedia.org/T357547)
[13:15:52] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2024-03-21-114859-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013273 (https://phabricator.wikimedia.org/T353510)
[13:16:13] <jayme>	 brouberol: ack - it probably just missing a puppet run then to take effect
[13:16:22] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[13:16:32] <wikibugs>	 06SRE, 10ChangeProp, 06Commons, 10GitLab, and 7 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9648979 (10Reedy) https://github.com/Snapchat/KeyDB already existed as a fork. https://github.com/Snapchat/KeyDB/issues/798 was filed ex...
[13:17:00] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:mariadb::ferm_misc Add new IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1013245 (owner: 10Slyngshede)
[13:17:06] <effie>	 Dear Deployers, deployment server is switched to deploy1002, you can proceed 
[13:17:08] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Also disable monitoring for AQS1 [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff)
[13:17:16] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:acme_chief::certificates Add new IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013241 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[13:17:48] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: update deployment_server to deploy1002 (switchover #7) [puppet] - 10https://gerrit.wikimedia.org/r/1013274 (https://phabricator.wikimedia.org/T357547)
[13:17:56] <wikibugs>	 (03Abandoned) 10Clément Goubert: Revert "Add File:Claus_-_Conkle to blacklist" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012771 (owner: 10Clément Goubert)
[13:18:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update articledesc and llm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013269 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos)
[13:18:36] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update articledesc and llm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013269 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos)
[13:19:09] <wikibugs>	 (03PS1) 10Fabfur: benthos: allow truncated http protocol version [puppet] - 10https://gerrit.wikimedia.org/r/1013275 (https://phabricator.wikimedia.org/T358109)
[13:19:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] hieradata: update deployment_server to deploy1002 (switchover #7) [puppet] - 10https://gerrit.wikimedia.org/r/1013274 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli)
[13:19:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] deployment: update deployment DNS record to deploy1002 (switchover #6) [dns] - 10https://gerrit.wikimedia.org/r/1013272 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli)
[13:19:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] deployment: update deployment DNS record to deploy1002 (switchover #6) [dns] - 10https://gerrit.wikimedia.org/r/1013272 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli)
[13:20:00] <effie>	 wikibugs is prolly doing something it shouldnt 
[13:20:09] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] hieradata: update deployment_server to deploy1002 (switchover #7) [puppet] - 10https://gerrit.wikimedia.org/r/1013274 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli)
[13:20:41] <wikibugs>	 (03PS1) 10Fabfur: benthos: added $schema key to unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1013278 (https://phabricator.wikimedia.org/T360450)
[13:22:06] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9649036 (10Clement_Goubert)
[13:23:26] <wikibugs>	 (03PS1) 10Cparle: Sunsetting MachineVision extension, so remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013284 (https://phabricator.wikimedia.org/T352884)
[13:23:38] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9649062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye
[13:23:46] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] superset-next: upgrade to 3.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013236 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol)
[13:25:59] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012664
[13:27:03] <wikibugs>	 (03PS1) 10Ammarpad: Set wgUploadNavigationUrl for is.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013295 (https://phabricator.wikimedia.org/T360431)
[13:27:39] <wikibugs>	 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625 (10Clement_Goubert) 03NEW
[13:27:51] <wikibugs>	 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649117 (10Clement_Goubert) p:05Triage→03High
[13:30:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Also disable monitoring for AQS1 [puppet] - 10https://gerrit.wikimedia.org/r/1013238 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff)
[13:31:29] <wikibugs>	 (03CR) 10Muehlenhoff: Inform users that their email address needs to be unique. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede)
[13:31:45] <wikibugs>	 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649168 (10Clement_Goubert)
[13:33:57] <wikibugs>	 (03PS2) 10Slyngshede: Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757
[13:34:05] <wikibugs>	 (03CR) 10Slyngshede: Inform users that their email address needs to be unique. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede)
[13:34:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede)
[13:34:37] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede)
[13:34:53] <wikibugs>	 (03Merged) 10jenkins-bot: Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede)
[13:36:38] <wikibugs>	 (03PS1) 10Clément Goubert: envoy: Add mw-jobrunner and videoscaler listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625)
[13:38:34] <wikibugs>	 (03PS1) 10Slyngshede: Bitu version 0.0.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1013304
[13:38:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1013304 (owner: 10Slyngshede)
[13:39:07] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Bitu version 0.0.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1013304 (owner: 10Slyngshede)
[13:39:39] <wikibugs>	 (03Merged) 10jenkins-bot: Bitu version 0.0.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1013304 (owner: 10Slyngshede)
[13:41:43] <wikibugs>	 (03PS9) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850)
[13:42:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[13:42:49] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9649261 (10RobH) > We would like remote hands to fetch shipmnet DEL0158639 which contains (8) 6.5TB NVMe PCIe SSDs from Dell NL to Wikimedia. >  > Proposted Work Window: 2023-03-27 @ 1100 CET >...
[13:45:48] <wikibugs>	 (03PS2) 10Jelto: gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520
[13:48:17] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9649336 (10ssingh) @RobH: Verified the hosts, serial numbers, racking and the cadence. Looks good!
[13:48:29] <wikibugs>	 (03PS1) 10Majavah: haproxy: cloud: use package{} to install haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1013308 (https://phabricator.wikimedia.org/T360630)
[13:48:37] <wikibugs>	 (03PS1) 10Majavah: P:metricsinfra: haproxy: do not set httplog on backends [puppet] - 10https://gerrit.wikimedia.org/r/1013309
[13:48:45] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: haproxy: use http-request replace-path [puppet] - 10https://gerrit.wikimedia.org/r/1013310 (https://phabricator.wikimedia.org/T360630)
[13:49:26] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1676/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013308 (https://phabricator.wikimedia.org/T360630) (owner: 10Majavah)
[13:49:34] <wikibugs>	 (03CR) 10Jelto: gitlab: fix irc log for backup complete message (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto)
[13:50:03] <sukhe>	 !log upgrading pdns-rec to 4.8.7-1 on dns* and doh* hosts
[13:50:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:00] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311
[13:52:35] <eoghan>	 Are these notifications delayed? I put up PS1 for this about 20 minutes ago
[13:52:39] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto)
[13:52:55] <wikibugs>	 (03CR) 10Majavah: Inform users that their email address needs to be unique. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 (owner: 10Slyngshede)
[13:53:35] <wikibugs>	 (03CR) 10JMeybohm: external-services: define a chart referencing external services clusters (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[13:53:43] <wikibugs>	 (03PS2) 10Tchanders: Schedule weekly purge of global_block_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516)
[13:53:59] <wikibugs>	 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, and 2 others: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649410 (10Joe) There is a few reasons why we didn't migrate changeprop to use the service mesh, first of all the fact we don't want to define timeouts ou...
[13:54:11] <wikibugs>	 (03CR) 10JMeybohm: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol)
[13:54:21] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9649412 (10RobH) CS1553796 created.  Will update one they confirm the window.
[13:54:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 (owner: 10EoghanGaffney)
[13:54:49] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9649413 (10RobH)
[13:55:09] <wikibugs>	 (03PS1) 10David Caro: puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312
[13:55:33] <wikibugs>	 (03PS2) 10David Caro: puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312
[13:55:42] <wikibugs>	 (03CR) 10David Caro: puppetserver.cloud_vps: add role without stale certs check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013312 (owner: 10David Caro)
[13:56:10] <wikibugs>	 (03PS3) 10David Caro: puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312
[13:56:50] <wikibugs>	 (03CR) 10Majavah: [C:03+1] puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312 (owner: 10David Caro)
[13:57:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9649435 (10Jhancock.wm)
[13:57:44] <wikibugs>	 (03CR) 10JMeybohm: external-services: define a chart referencing external services clusters (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[13:57:52] <wikibugs>	 (03CR) 10David Caro: [C:03+2] puppetserver.cloud_vps: add role without stale certs check [puppet] - 10https://gerrit.wikimedia.org/r/1013312 (owner: 10David Caro)
[13:58:24] <wikibugs>	 (03PS2) 10EoghanGaffney: [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311
[14:00:29] <wikibugs>	 (03CR) 10JMeybohm: "Not a beauty but practical 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[14:01:11] <wikibugs>	 (03PS4) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390)
[14:01:19] <wikibugs>	 (03CR) 10JMeybohm: "*remove the need for the variable assignments" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[14:01:27] <wikibugs>	 (03CR) 10Elukey: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[14:02:02] <moritzm>	 !log installing squid security updates
[14:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:42] <wikibugs>	 (03CR) 10JMeybohm: Add template rendering external services egress NetworkPolicy resources (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:04:30] <wikibugs>	 06SRE: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636 (10MoritzMuehlenhoff) 03NEW
[14:04:38] <wikibugs>	 (03PS1) 10Klausman: ml-services: Free up unused nllb200 pods in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013317
[14:04:46] <wikibugs>	 (03CR) 10Brouberol: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol)
[14:05:22] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
[14:05:29] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Free up unused nllb200 pods in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013317 (owner: 10Klausman)
[14:05:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9649524 (10MoritzMuehlenhoff)
[14:06:01] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "That would do it, at least using the example given on T358940. `1006969` is linked while `#1006969` is not. That is a nice hack. I think G" [puppet] - 10https://gerrit.wikimedia.org/r/1013097 (https://phabricator.wikimedia.org/T358940) (owner: 10Aklapper)
[14:06:33] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto)
[14:07:07] <wikibugs>	 (03PS1) 10Ammarpad: throttle: Add throttle rule for editathon at Illinois Tech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494)
[14:07:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] throttle: Add throttle rule for editathon at Illinois Tech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) (owner: 10Ammarpad)
[14:07:40] <wikibugs>	 (03CR) 10Klausman: [C:03+2] ml-services: Free up unused nllb200 pods in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013317 (owner: 10Klausman)
[14:08:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff)
[14:09:12] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Free up unused nllb200 pods in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013317 (owner: 10Klausman)
[14:09:20] <wikibugs>	 (03PS2) 10Ammarpad: throttle: Add throttle rule for editathon at Illinois Tech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494)
[14:09:36] <wikibugs>	 (03Merged) 10jenkins-bot: gitlab: fix irc log for backup complete message [cookbooks] - 10https://gerrit.wikimedia.org/r/1009520 (owner: 10Jelto)
[14:10:40] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1013097 (https://phabricator.wikimedia.org/T358940) (owner: 10Aklapper)
[14:12:23] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/WMF for zoe - https://phabricator.wikimedia.org/T360639 (10zoe) 03NEW
[14:13:03] <wikibugs>	 (03PS7) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[14:13:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[14:14:32] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/WMF for zoe - https://phabricator.wikimedia.org/T360639#9649648 (10zoe)
[14:16:02] <wikibugs>	 (03PS1) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321
[14:16:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:16:34] <wikibugs>	 (03PS8) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[14:17:15] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: fix discrepancies caused by shoddy c&p in 1013317 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman)
[14:18:09] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1678/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[14:18:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudbackup2004 to codfw - jhancock@cumin2002"
[14:19:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudbackup2004 to codfw - jhancock@cumin2002"
[14:19:18] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:20:07] <wikibugs>	 (03CR) 10JMeybohm: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol)
[14:20:15] <wikibugs>	 (03PS25) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[14:20:23] <wikibugs>	 (03CR) 10Brouberol: external-services: define a chart referencing external services clusters (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:21:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudbackup2004.mgmt.codfw.wmnet with reboot policy FORCED
[14:21:12] <wikibugs>	 (03PS4) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508)
[14:21:20] <wikibugs>	 (03CR) 10Brouberol: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol)
[14:21:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:24:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Point urldownloader in eqiad to 1004 [dns] - 10https://gerrit.wikimedia.org/r/1013322
[14:26:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Point urldownloader in eqiad to 1004 [dns] - 10https://gerrit.wikimedia.org/r/1013322 (owner: 10Muehlenhoff)
[14:27:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff)
[14:28:02] <wikibugs>	 (03Abandoned) 10Muehlenhoff: prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff)
[14:28:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Good questions! If the impact is significant and the trade-offs acceptable (e.g. the dashboards like you mentioned) then ideally I'd like " [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi)
[14:31:22] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, according to debmonitor there is no host on bookworm using the exporter currently: https://debmonitor.wikimedia.org/packages/prometh" [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn)
[14:32:12] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] etherpad: install mariadb server in wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003769 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto)
[14:33:21] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 (owner: 10EoghanGaffney)
[14:33:35] <wikibugs>	 (03PS2) 10Clément Goubert: envoy: Add missing service mesh listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625)
[14:33:38] <wikibugs>	 (03PS1) 10Muehlenhoff: aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522)
[14:34:23] <wikibugs>	 (03PS4) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894)
[14:35:20] <wikibugs>	 (03CR) 10Brouberol: Add template rendering external services egress NetworkPolicy resources (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:35:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T356166)', diff saved to https://phabricator.wikimedia.org/P58871 and previous config saved to /var/cache/conftool/dbconfig/20240321-143528-marostegui.json
[14:35:33] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[14:35:41] <wikibugs>	 (03PS5) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508)
[14:35:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:36:29] <moritzm>	 !log installing glibc security updates on bullseye
[14:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:55] <wikibugs>	 (03PS6) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508)
[14:37:17] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff)
[14:39:10] <wikibugs>	 (03PS7) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508)
[14:39:39] <wikibugs>	 (03PS26) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[14:40:00] <wikibugs>	 (03PS5) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894)
[14:41:20] <godog>	 brouberol: re: aqs probes you'll have to change the service status in hieradata/common/service.yaml or set page: false in the service stanza btw
[14:41:53] <brouberol>	 moritzm: could you have a look please? I'm about to go get my kid from daycare. Thank you!
[14:41:59] <godog>	 assuming that's the intended idea, i.e. aqs.discovery.wmnet backends no longer being a thing
[14:42:14] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:42:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "The idp_nodes value is eventually passed down to the memcached config, so that all IDPs update the same memcached backends. Given they are" [puppet] - 10https://gerrit.wikimedia.org/r/1013237 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[14:42:32] <brouberol>	 yes, we were asked to decom AQS, so this is very much no longer a thing
[14:42:54] <wikibugs>	 (03PS1) 10Majavah: ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324
[14:43:04] <godog>	 brouberol: ack, I'll do the page: false setting thing
[14:43:08] <wikibugs>	 (03PS3) 10Clément Goubert: envoy: Add missing service mesh listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625)
[14:43:46] <brouberol>	 I can take care of the CR, but I need to get going right after, meaning I won't be able to deploy it for a bit
[14:43:56] <brouberol>	 I didn't mean to impose :/
[14:43:58] <wikibugs>	 06SRE, 10ChangeProp, 10MW-on-K8s, 06serviceops, and 2 others: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649779 (10Clement_Goubert)
[14:44:40] <godog>	 brouberol: no worries at all! easy enough and I'm in the middle of deploying another prometheus change which means puppet is stopped anyways
[14:44:52] <brouberol>	 appreciated, thank you!
[14:45:04] <moritzm>	 godog, brouberol: let's set page=false as an interim and then we can yank the entire service definition as a followup
[14:45:15] <godog>	 brouberol: sure np
[14:45:18] <wikibugs>	 (03PS2) 10Majavah: ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324
[14:45:18] <godog>	 moritzm: ack, will do
[14:46:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: set aqs to non-paging [puppet] - 10https://gerrit.wikimedia.org/r/1013325 (https://phabricator.wikimedia.org/T360522)
[14:46:08] <godog>	 ^
[14:46:29] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1681/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah)
[14:47:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013325 (https://phabricator.wikimedia.org/T360522) (owner: 10Filippo Giunchedi)
[14:47:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: set aqs to non-paging [puppet] - 10https://gerrit.wikimedia.org/r/1013325 (https://phabricator.wikimedia.org/T360522) (owner: 10Filippo Giunchedi)
[14:47:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 17 hosts with reason: Schema change T356166
[14:48:03] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[14:48:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 17 hosts with reason: Schema change T356166
[14:48:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641 (10MPhamWMF) 03NEW
[14:48:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah)
[14:50:01] <wikibugs>	 (03CR) 10Majavah: "This was still in use in profile::toolforge::prometheus?" [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff)
[14:50:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P58872 and previous config saved to /var/cache/conftool/dbconfig/20240321-145036-marostegui.json
[14:51:55] <wikibugs>	 (03PS27) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[14:51:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "How so? Per PCC it's unused, see the earlier PCC output." [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff)
[14:52:13] <wikibugs>	 (03CR) 10Brouberol: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:52:58] <wikibugs>	 (03PS9) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[14:53:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Remove unused profile" [puppet] - 10https://gerrit.wikimedia.org/r/1013326
[14:54:33] <wikibugs>	 (03CR) 10Majavah: "I think the PCC sync might be broken due to our recent Puppet 7 migration (and Andrew is working on fixing it), but the profile is very mu" [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff)
[14:55:53] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: Fix blackbox exporter installation [puppet] - 10https://gerrit.wikimedia.org/r/1013327
[14:56:11] <wikibugs>	 (03CR) 10Majavah: "Or let's do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013327 instead?" [puppet] - 10https://gerrit.wikimedia.org/r/1013326 (owner: 10Muehlenhoff)
[14:56:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[14:57:17] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:57:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013327 (owner: 10Majavah)
[14:57:37] <wikibugs>	 (03CR) 10Muehlenhoff: "Sure, that works. +1d" [puppet] - 10https://gerrit.wikimedia.org/r/1013326 (owner: 10Muehlenhoff)
[14:57:43] <wikibugs>	 (03CR) 10Dzahn: "prometheus::blackbox::check::http which is used all over the place says:" [puppet] - 10https://gerrit.wikimedia.org/r/1013270 (owner: 10Muehlenhoff)
[14:58:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Key verified out of band." [puppet] - 10https://gerrit.wikimedia.org/r/1013139 (owner: 10CDobbins)
[14:58:40] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1013139 (owner: 10CDobbins)
[14:59:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:00:04] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Fix blackbox exporter installation [puppet] - 10https://gerrit.wikimedia.org/r/1013327 (owner: 10Majavah)
[15:02:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9649869 (10MMiller_WMF) I am Mike's manager and I approve this request!
[15:03:48] <wikibugs>	 (03CR) 10Muehlenhoff: "I'm still reverting the change, though since apparently this is also used by the Ci..." [puppet] - 10https://gerrit.wikimedia.org/r/1013326 (owner: 10Muehlenhoff)
[15:03:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove unused profile" [puppet] - 10https://gerrit.wikimedia.org/r/1013326 (owner: 10Muehlenhoff)
[15:04:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudbackup2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:05:42] <wikibugs>	 (03PS2) 10Muehlenhoff: aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522)
[15:05:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P58873 and previous config saved to /var/cache/conftool/dbconfig/20240321-150544-marostegui.json
[15:06:08] <wikibugs>	 (03PS2) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321
[15:06:13] <wikibugs>	 (03PS10) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[15:06:45] <wikibugs>	 (03CR) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman)
[15:06:59] <wikibugs>	 (03CR) 10Klausman: ml-services: fix discrepancies caused by shoddy c&p in 1013317 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman)
[15:09:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 36.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:11:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:12:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudbackup2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:14:18] <wikibugs>	 (03CR) 10BryanDavis: [C:04-2] "Let's sit on this idea for a bit while we wait to see if a strong hard fork of Redis shows up following https://redis.com/blog/redis-adopt" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) (owner: 10BryanDavis)
[15:14:42] <wikibugs>	 (03PS1) 10Cparle: MachineVision is being sunsetted, so remove job [puppet] - 10https://gerrit.wikimedia.org/r/1013329 (https://phabricator.wikimedia.org/T352884)
[15:16:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:16:42] <wikibugs>	 (03CR) 10JMeybohm: external-services: define a chart referencing external services clusters (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[15:16:55] <wikibugs>	 (03Restored) 10Muehlenhoff: prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff)
[15:18:58] <jinxer-wm>	 (ProbeDown) resolved: (2) Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aqs:7232 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:32] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9649919 (10Papaul) dbprov2005 re-image is stocked at puppet run. When i login to the server and try to manually run puppet i get the error below. ` Error: The CRL issued by 'CN=...
[15:20:35] <wikibugs>	 (03CR) 10JMeybohm: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol)
[15:20:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T356166)', diff saved to https://phabricator.wikimedia.org/P58874 and previous config saved to /var/cache/conftool/dbconfig/20240321-152051-marostegui.json
[15:20:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[15:20:57] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[15:20:59] <wikibugs>	 06SRE, 10ChangeProp, 06Commons, 10GitLab, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9649904 (10brennen) For GitLab: I //think// we currently run the bundled Redis in their Omnibus package. In that case, the easiest thing...
[15:21:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[15:21:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[15:21:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[15:21:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T356166)', diff saved to https://phabricator.wikimedia.org/P58875 and previous config saved to /var/cache/conftool/dbconfig/20240321-152134-marostegui.json
[15:21:41] <wikibugs>	 (03CR) 10Jcrespo: "Thank you, you now understood what I meant. Looking good, no blockers on my side to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb)
[15:22:22] <wikibugs>	 (03PS28) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[15:22:31] <wikibugs>	 (03CR) 10Brouberol: external-services: define a chart referencing external services clusters (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[15:22:53] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudbackup2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:23:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:24:44] <wikibugs>	 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9649944 (10thcipriani)
[15:25:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9649946 (10Jhancock.wm) Found the drive as absent in iDRAC. Physically, the drive is there but is not blinking like the other drives....
[15:25:23] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[15:26:09] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:26:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2004']
[15:27:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudbackup2004']
[15:27:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2004']
[15:28:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:30:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:31:08] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 (owner: 10EoghanGaffney)
[15:32:30] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbprov2005.codfw.wmnet with OS bullseye
[15:32:35] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye executed with errors: - dbprov20...
[15:33:21] <wikibugs>	 (03PS8) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508)
[15:33:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye
[15:33:52] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye
[15:33:57] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudbackup2004']
[15:34:01] <wikibugs>	 (03CR) 10Brouberol: admin-ng: Define external services namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol)
[15:34:58] <wikibugs>	 (03Merged) 10jenkins-bot: [gitlab] Lock backups on the destination host before starting [cookbooks] - 10https://gerrit.wikimedia.org/r/1013311 (owner: 10EoghanGaffney)
[15:35:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:37:40] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650056 (10jcrespo) That is new and doesn't happen on the old hosts, but not a big blocker. However, the OS was installed on the SSDs, not on the HDs- that is much more unfixabl...
[15:38:33] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff)
[15:41:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "prometheus: scrape envoy on k8s metrics with 'usedonly'" [puppet] - 10https://gerrit.wikimedia.org/r/1013254
[15:41:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9650076 (10Jhancock.wm)
[15:41:45] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Revert "prometheus: scrape envoy on k8s metrics with 'usedonly'" [puppet] - 10https://gerrit.wikimedia.org/r/1013254 (owner: 10Filippo Giunchedi)
[15:41:54] <wikibugs>	 (03CR) 10Aklapper: [C:03+2] "+2 self-approving" [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/984213 (https://phabricator.wikimedia.org/T338611) (owner: 10Aklapper)
[15:42:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] Revert "prometheus: scrape envoy on k8s metrics with 'usedonly'" [puppet] - 10https://gerrit.wikimedia.org/r/1013254 (owner: 10Filippo Giunchedi)
[15:42:36] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] AVA: Remove unused variable; take age into account [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/984213 (https://phabricator.wikimedia.org/T338611) (owner: 10Aklapper)
[15:50:41] <claime>	 !log cgoubert@deploy1002:~$ sudo chown imagecatalog:imagecatalog /srv/deployment/imagecatalog/catalog.sqlite
[15:50:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:19] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:53:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure dbprov2005/2006 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013332 (https://phabricator.wikimedia.org/T355355)
[15:53:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: imagecatalog_record.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:53:48] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9650092 (10ovasileva) >>! In T355914#9578211, @Jdlrobson wrote: > Providing engineering  perspective on behalf of the WMF web team, I agree that if we wan...
[15:54:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:54:59] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] "Thanks for the review volans!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:55:01] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:56:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Configure dbprov2005/2006 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013332 (https://phabricator.wikimedia.org/T355355) (owner: 10Muehlenhoff)
[16:00:04] <jouncebot>	 jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] requesttracker: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[16:01:28] <wikibugs>	 (03PS2) 10Dzahn: requesttracker: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413)
[16:04:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Delete peopleweb dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013227 (https://phabricator.wikimedia.org/T360413) (owner: 10Muehlenhoff)
[16:04:17] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] Delete peopleweb dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013227 (https://phabricator.wikimedia.org/T360413) (owner: 10Muehlenhoff)
[16:04:59] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox
[16:05:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:06:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9650260 (10cmooney) >>! In T326322#9130092, @ayounsi wrote: > @cmooney I came across https://www.juniper.net/documentation/us/en/softwar...
[16:06:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T356166)', diff saved to https://phabricator.wikimedia.org/P58878 and previous config saved to /var/cache/conftool/dbconfig/20240321-160653-marostegui.json
[16:06:57] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[16:07:03] <urandom>	 !log disabling read-repair (Cassandra) for restbase tables — T360548
[16:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:07] <stashbot>	 T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548
[16:08:38] <wikibugs>	 (03PS1) 10Elukey: Add the amd-pytorch base image for ML workloads [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638)
[16:10:15] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[16:12:55] <wikibugs>	 (03PS56) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[16:13:05] <wikibugs>	 (03CR) 10Elukey: "Hi folks! This is the first version of the Pytorch's base image. The total size is 12.4GB (!! sigh), but I am able to run a python3 interp" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[16:13:40] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: 14decommission db2096 - 14https://phabricator.wikimedia.org/T360554#9650362 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:14:03] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:14:09] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:14:20] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye
[16:15:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:17:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye
[16:17:39] <logmsgbot>	 !log sukhe@cumin1002 END (ERROR) - Cookbook sre.dns.roll-reboot (exit_code=97) rolling reboot on A:dnsbox
[16:17:50] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye
[16:18:30] <jinxer-wm>	 (ProbeDown) firing: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:25] <wikibugs>	 (03PS1) 10Ssingh: sre.dns.roll-reboot: fix typo in depool_sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/1013336
[16:20:50] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/WMF for zoe - https://phabricator.wikimedia.org/T360639#9650405 (10VPuffetMichel) Hi there, Zoe is the new member of the editing team. Let me know if you need anything from me.
[16:21:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.68% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:22:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P58879 and previous config saved to /var/cache/conftool/dbconfig/20240321-162200-marostegui.json
[16:22:41] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:22:42] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] gitlab: temporary allow dockerfile frontend on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto)
[16:24:29] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] sre.dns.roll-reboot: fix typo in depool_sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/1013336 (owner: 10Ssingh)
[16:25:24] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=registry1003.eqiad.wmnet
[16:25:58] <elukey>	 !log expand vram for registry100[3,4] from 4G to 6G - T360637
[16:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:01] <stashbot>	 T360637: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637
[16:26:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:27:53] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM registry1003.eqiad.wmnet
[16:30:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:33:35] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye
[16:34:24] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dns-rec and not P{dns1004*} and A:dnsbox
[16:34:27] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) (owner: 10Brouberol)
[16:35:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:35:44] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[16:35:52] <elukey>	 !log edit /etc/network/interfaces on registry1003 (ens5 => ens13) - T360637
[16:35:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:56] <stashbot>	 T360637: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637
[16:36:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye
[16:36:17] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650500 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye
[16:36:20] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov2005.codfw.wmnet with OS bullseye
[16:36:34] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9650501 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye executed w...
[16:37:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P58880 and previous config saved to /var/cache/conftool/dbconfig/20240321-163708-marostegui.json
[16:37:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:37:17] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job docker-registry in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:38:29] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1003.eqiad.wmnet
[16:38:45] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=registry1003.eqiad.wmnet
[16:38:58] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=registry1004.eqiad.wmnet
[16:39:02] <wikibugs>	 (03CR) 10JMeybohm: "Quick question right away: Does it make sense to start with a "version including" naming scheme right away? Will there be pytorch2.2 and p" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[16:39:04] <wikibugs>	 (03CR) 10MVernon: [V:03+2 C:03+2] Add new ceph container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[16:39:39] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM registry1004.eqiad.wmnet
[16:40:00] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab] Switch gitlab-replica from gitlab1004 to gitlab1003 [puppet] - 10https://gerrit.wikimedia.org/r/1013339 (https://phabricator.wikimedia.org/T358559)
[16:40:25] <jinxer-wm>	 (SystemdUnitFailed) firing: build-homepage.service on registry1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:40:26] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: fix discrepancies caused by shoddy c&p in 1013317 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013321 (owner: 10Klausman)
[16:41:57] <jinxer-wm>	 (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:42:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:42:17] <wikibugs>	 (03CR) 10Elukey: "This is a good point, I had in my mind the idea that only one pytorch version will be canonical in the future, but for sure we'll end up h" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[16:42:17] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job docker-registry in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:42:23] <herron>	 !incidents
[16:42:23] <sirenbot>	 4532 (UNACKED)  ProbeDown sre (10.2.2.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 eqiad)
[16:42:23] <sirenbot>	 4531 (RESOLVED)  [2x] ProbeDown sre (ip4 aqs:7232 probes/service http_aqs_ip4)
[16:42:23] <sirenbot>	 4530 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams)
[16:42:24] <sirenbot>	 4529 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams)
[16:42:24] <sirenbot>	 4528 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams)
[16:42:43] <herron>	 !ack 4532
[16:42:43] <jinxer-wm>	 (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:42:43] <sirenbot>	 4532 (ACKED)  ProbeDown sre (10.2.2.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 eqiad)
[16:42:56] <jhathaway>	 o/, herron known?
[16:42:58] <elukey>	 herron: oooff sorry
[16:43:01] <herron>	 is that expected/related?  not known to me
[16:43:21] <herron>	 elukey: ha no worries, you are working on it?
[16:43:22] <elukey>	 I am bumping the ram on the eqiad registry vms, but one is up (I am working on the other one
[16:43:28] <elukey>	 not sure why it paged
[16:43:41] <herron>	 ack ok, thank you
[16:44:28] <elukey>	 !log edit /etc/network/interfaces on registry1004 (ens5 => ens13) - T360637
[16:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:32] <stashbot>	 T360637: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637
[16:46:24] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1004.eqiad.wmnet
[16:46:39] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=registry1004.eqiad.wmnet
[16:46:57] <jinxer-wm>	 (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:47:09] <Reedy>	 jouncebot: nowandnext
[16:47:09] <jouncebot>	 For the next 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1600)
[16:47:09] <jouncebot>	 In 0 hour(s) and 12 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1700)
[16:47:09] <jouncebot>	 In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1700)
[16:47:12] <elukey>	 this makes zero sense to me herron
[16:47:25] <herron>	 did the interface name change on the reboot?
[16:47:26] <jinxer-wm>	 (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:47:36] <elukey>	 I was working on registry1004 (depooled) and registry1003 was pooled and up
[16:47:43] <elukey>	 yes yes it changed, I had to fix it etc..
[16:47:47] <elukey>	 but the other host was up
[16:48:44] <jhathaway>	 registry1003 has an uptime of 19min, is that expected?
[16:48:59] <mutante>	 if it's like other blackbox checks applied to the role it checks the same virtual host on all backends
[16:49:02] <elukey>	 yes yes I worked on that as well, before 1004
[16:49:07] <jhathaway>	 ah
[16:50:44] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "Not necessary to change these files - they are just static snapshots." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012989 (https://phabricator.wikimedia.org/T359983) (owner: 10Mabualruz)
[16:51:28] <elukey>	 mutante: o/ but sre.ganeti.reboot-vm does the downtime etc.., so in theory the host on which I was working on shouldn't have alarmed
[16:52:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T356166)', diff saved to https://phabricator.wikimedia.org/P58881 and previous config saved to /var/cache/conftool/dbconfig/20240321-165215-marostegui.json
[16:52:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[16:52:23] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[16:52:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[16:52:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T356166)', diff saved to https://phabricator.wikimedia.org/P58882 and previous config saved to /var/cache/conftool/dbconfig/20240321-165240-marostegui.json
[16:52:51] <mutante>	 elukey: hmm.. it's not unheard of that we had "failed to set downtime" in cookbook 
[16:52:54] <cdanis>	 are the blackbox probes per-host, or do they go against the service address?
[16:52:54] <wikibugs>	 (03PS2) 10Elukey: Add the amd-pytorch base image for ML workloads [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638)
[16:53:30] <elukey>	 cdanis: no idea, I thought the service but I didn't check before the maintenance, my bad
[16:53:40] <cdanis>	 I was pretty sure service address as well
[16:53:44] <cdanis>	 but I don't actually know
[16:54:22] <elukey>	 in theory what alarmed was the http_docker-registry_ip4
[16:55:17] <elukey>	 I am wondering if for some reason registry1003 was not completely up when I worked on 1004, service wise
[16:56:36] <elukey>	 nope access logs are good for 1003
[16:56:38] <wikibugs>	 (03CR) 10JMeybohm: global_config: rework external services data structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[16:59:40] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[16:59:54] <mutante>	 this should answer the question if it's on each backend or not, looks like not:
[17:00:00] <mutante>	 https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*registry.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[17:00:05] <jouncebot>	 bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1700)
[17:00:12] <mutante>	 ^ all probe results matching *registry*
[17:00:39] <elukey>	 thanks it matches with what I found as well
[17:00:46] <wikibugs>	 (03PS11) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[17:00:55] <elukey>	 at this point the only thing that I can think of is that I was too quick in moving to the other node
[17:01:22] <wikibugs>	 (03PS1) 10Fabfur: benthos/haproxy: delete some fields that aren't in curr webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1013341 (https://phabricator.wikimedia.org/T360642)
[17:01:28] <wikibugs>	 (03PS12) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[17:01:57] <elukey>	 sorry for the noise folks!
[17:02:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[17:03:09] <mutante>	 probably just unlucky timing, ack
[17:03:27] <wikibugs>	 (03PS1) 10Jdlrobson: Support legacy message box styles markup in JavaScript [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633)
[17:04:06] * elukey afk o/
[17:04:24] <wikibugs>	 (03PS13) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[17:06:01] <urandom>	 !log restarting decommissions (restbase1024-{b,c}) — T360548
[17:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:18] <stashbot>	 T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548
[17:07:31] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1683/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[17:08:30] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] global_config: rework external services data structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[17:09:33] <wikibugs>	 (03PS1) 10Dzahn: delete rt.discovery.wmnet certificate, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013345 (https://phabricator.wikimedia.org/T360413)
[17:11:01] <wikibugs>	 (03PS1) 10Reedy: GenerateFancyCaptchas: Include stderr result if captcha.py returns an error code [extensions/ConfirmEdit] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013257 (https://phabricator.wikimedia.org/T360653)
[17:11:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:13:59] <wikibugs>	 (03PS1) 10Dzahn: delete rt.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1013367 (https://phabricator.wikimedia.org/T360413)
[17:14:12] <wikibugs>	 (03PS2) 10Dzahn: delete rt.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1013367 (https://phabricator.wikimedia.org/T360413)
[17:14:36] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] delete rt.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1013367 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[17:15:16] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] delete rt.discovery.wmnet certificate, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013345 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[17:15:40] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9650773 (10Dzahn)
[17:16:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 36.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:17:46] <wikibugs>	 (03PS2) 10Dzahn: planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413)
[17:18:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[17:19:40] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:20:21] <wikibugs>	 (03PS1) 10Cparle: MachineVision extension is being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967)
[17:22:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Support legacy message box styles markup in JavaScript [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633) (owner: 10Jdlrobson)
[17:23:04] <wikibugs>	 (03CR) 10Reedy: [C:03+2] GenerateFancyCaptchas: Include stderr result if captcha.py returns an error code [extensions/ConfirmEdit] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013257 (https://phabricator.wikimedia.org/T360653) (owner: 10Reedy)
[17:24:25] <wikibugs>	 (03PS14) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765
[17:24:33] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012765 (owner: 10Andrew Bogott)
[17:27:38] <wikibugs>	 (03PS2) 10Cparle: MachineVision extension is being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967)
[17:28:13] <wikibugs>	 (03PS3) 10Cparle: MachineVision extension is being sunsetted, so stop doing dumps [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967)
[17:30:55] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9650886 (10cmooney)
[17:35:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: build-homepage.service on registry1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:39:13] <wikibugs>	 (03Merged) 10jenkins-bot: GenerateFancyCaptchas: Include stderr result if captcha.py returns an error code [extensions/ConfirmEdit] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013257 (https://phabricator.wikimedia.org/T360653) (owner: 10Reedy)
[18:00:05] <jouncebot>	 dancy and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T1800)
[18:00:17] <logmsgbot>	 !log reedy@deploy1002 Synchronized php-1.42.0-wmf.23/extensions/ConfirmEdit/maintenance/GenerateFancyCaptchas.php: T360653 (duration: 16m 00s)
[18:00:25] <dancy>	 oooh perfect timing
[18:00:29] <stashbot>	 T360653: GenerateFancyCaptchas doesn't output errors relating to running captcha.py - https://phabricator.wikimedia.org/T360653
[18:00:29] <dancy>	 nice work Reedy
[18:00:33] <Reedy>	 :D
[18:00:58] <dancy>	 All clear?
[18:01:11] <Reedy>	 yup :)
[18:01:32] <dancy>	 Alright.  Pressing the button
[18:01:44] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013378 (https://phabricator.wikimedia.org/T354441)
[18:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013378 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot)
[18:02:30] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013378 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot)
[18:06:08] <wikibugs>	 (03PS1) 10Andrew Bogott: base: remove profile::base::manage_timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/1013382
[18:12:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 31.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:13:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 938.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:13:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye
[18:14:06] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9651230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye
[18:16:24] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.23  refs T354441
[18:16:28] <stashbot>	 T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441
[18:17:41] <wikibugs>	 (03CR) 10Dreamy Jazz: "I'd prefer that tests exist for the script before we run it automatically on all wikis, but not a deal breaker to me." [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders)
[18:18:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 918.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:21:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T356166)', diff saved to https://phabricator.wikimedia.org/P58884 and previous config saved to /var/cache/conftool/dbconfig/20240321-182117-marostegui.json
[18:21:22] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[18:22:39] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "ERROR: Failed to parse hieradata/role/common/planet.yaml: (hieradata/role/common/planet.yaml): did not find expected alphabetic or numeric" [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[18:25:43] <wikibugs>	 (03PS1) 10Ahmon Dancy: logstash_checker.py: Fix error reporting bug [puppet] - 10https://gerrit.wikimedia.org/r/1013385
[18:26:16] <wikibugs>	 (03PS3) 10Dzahn: planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413)
[18:27:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[18:27:49] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] logstash_checker.py: Fix error reporting bug [puppet] - 10https://gerrit.wikimedia.org/r/1013385 (owner: 10Ahmon Dancy)
[18:30:21] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye
[18:32:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 36.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:33:23] <wikibugs>	 (03CR) 10Krinkle: mediawiki.yaml: Use static.php to serve www.mediawiki.org/ontology/ontology.owl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[18:34:21] <wikibugs>	 (03PS4) 10Dzahn: planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413)
[18:34:54] <wikibugs>	 (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[18:35:34] <wikibugs>	 (03PS1) 10Dzahn: delete planet.discovery.wmnet certificate, switched to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013387 (https://phabricator.wikimedia.org/T360413)
[18:36:03] <wikibugs>	 (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[18:36:13] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9651386 (10Papaul) @MoritzMuehlenhoff i tried again the re-image once the server reboots after the OS install the cookbook failed with error below. ` Excep...
[18:36:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P58886 and previous config saved to /var/cache/conftool/dbconfig/20240321-183625-marostegui.json
[18:36:37] <wikibugs>	 (03PS1) 10Dzahn: delete planet.discovery.wmnet key [labs/private] - 10https://gerrit.wikimedia.org/r/1013388 (https://phabricator.wikimedia.org/T360413)
[18:39:01] <wikibugs>	 (03PS2) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[18:40:11] <wikibugs>	 (03PS3) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[18:41:15] <wikibugs>	 (03CR) 10Dreamy Jazz: "We may want to wait until we have a date for deployment and wait to merge this until deployment is not far away to avoid the API and Speci" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz)
[18:42:14] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:46:24] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013382 (owner: 10Andrew Bogott)
[18:47:16] <wikibugs>	 (03PS15) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765
[18:47:34] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012765 (owner: 10Andrew Bogott)
[18:51:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P58887 and previous config saved to /var/cache/conftool/dbconfig/20240321-185132-marostegui.json
[18:51:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:54:00] <wikibugs>	 (03PS3) 10Ahmon Dancy: mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807)
[18:54:00] <wikibugs>	 (03PS1) 10Ahmon Dancy: Route /w/docs/ to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807)
[18:54:40] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:54:47] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:54:59] <topranks>	 !log removing IPv6 VRRP config on codfw core routers for vlan 2018 private1-b-codfw T351534
[18:55:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[18:55:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:03] <stashbot>	 T351534: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534
[18:56:40] <wikibugs>	 (03CR) 10Ahmon Dancy: mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[18:58:08] <wikibugs>	 (03CR) 10Ahmon Dancy: "Analogous to the recent changes made in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012439" [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[18:59:51] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[18:59:54] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] Route /w/docs/ to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[19:00:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "SAN field looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[19:03:55] <wikibugs>	 (03PS2) 10Jdlrobson: Support legacy message box styles markup in JavaScript [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633)
[19:05:07] <wikibugs>	 (03CR) 10Krinkle: Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE))
[19:06:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T356166)', diff saved to https://phabricator.wikimedia.org/P58888 and previous config saved to /var/cache/conftool/dbconfig/20240321-190640-marostegui.json
[19:06:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[19:06:53] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[19:06:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[19:07:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1246.eqiad.wmnet with reason: Maintenance
[19:07:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1246.eqiad.wmnet with reason: Maintenance
[19:07:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T356166)', diff saved to https://phabricator.wikimedia.org/P58889 and previous config saved to /var/cache/conftool/dbconfig/20240321-190723-marostegui.json
[19:08:17] <wikibugs>	 (03PS2) 10Dzahn: delete planet.discovery.wmnet key [labs/private] - 10https://gerrit.wikimedia.org/r/1013388 (https://phabricator.wikimedia.org/T360413)
[19:09:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] delete planet.discovery.wmnet certificate, switched to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013387 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[19:09:36] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] delete planet.discovery.wmnet key [labs/private] - 10https://gerrit.wikimedia.org/r/1013388 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[19:09:56] <topranks>	 !log adding routes to codfw row b hosts towards spine switch IPs on private1-b-codfw T351534
[19:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:00] <stashbot>	 T351534: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534
[19:10:28] <wikibugs>	 (03CR) 10Dzahn: ssl: delete peopleweb cert, replaced by cfssl provided cert [puppet] - 10https://gerrit.wikimedia.org/r/1013128 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[19:11:07] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] ssl: delete peopleweb cert, replaced by cfssl provided cert [puppet] - 10https://gerrit.wikimedia.org/r/1013128 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[19:15:14] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9651556 (10Dzahn)
[19:16:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:17:02] <wikibugs>	 (03PS1) 10Dzahn: delete etherpad.discovery ssl key, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013393 (https://phabricator.wikimedia.org/T360413)
[19:17:44] <topranks>	 !log remove VRRP GW IP for vlan 2018 from codfw core routers and add to EVPN switches irb.2018 interface T351534
[19:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:48] <stashbot>	 T351534: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534
[19:20:19] <wikibugs>	 (03PS1) 10Dzahn: delete etherpad.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013394 (https://phabricator.wikimedia.org/T360413)
[19:20:20] <wikibugs>	 (03PS2) 10Dzahn: delete etherpad.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013393 (https://phabricator.wikimedia.org/T360413)
[19:22:35] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[28,32,34-36].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002
[19:22:40] <stashbot>	 T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548
[19:36:31] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: 14Create a mailing list for plwiki arbcom - 14https://phabricator.wikimedia.org/T360682#9651619 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup 14{{done}} https://lists.wikimedia.org/postorius/lists/wikipedia-pl-arbcom.lists.wikimedia.org/  Please let me know if you h...
[19:37:56] <wikibugs>	 (03PS1) 10Bking: elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878)
[19:39:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[19:39:34] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[19:40:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[19:41:38] <wikibugs>	 (03PS3) 10Bking: elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878)
[19:41:49] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[19:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[19:52:02] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[19:52:10] <wikibugs>	 (03CR) 10Bking: [C:03+2] elastic: Bring elastic2107/2108 into service [puppet] - 10https://gerrit.wikimedia.org/r/1013395 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[19:59:29] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240321T2000)
[20:00:05] <jouncebot>	 jan_drewniak and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:01:38] <jan_drewniak>	 o/
[20:02:16] <cjming>	 hi jan_drewniak - do you want to self-deploy?
[20:02:38] <cjming>	 i'm happy to do both of ours if you prefer
[20:04:10] <jan_drewniak>	 Hi cjming ! if you could do both that'd be great (I think you can do two at once with scap backport 1013255 1009718)
[20:04:35] <cjming>	 alrighty - i'll start in
[20:06:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633) (owner: 10Jdlrobson)
[20:10:48] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-b-codfw entries - cmooney@cumin1002"
[20:11:04] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[28,32,34-36].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002
[20:11:08] <stashbot>	 T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548
[20:11:41] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-b-codfw entries - cmooney@cumin1002"
[20:11:41] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:11:44] <wikibugs>	 (03PS1) 10Bking: elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878)
[20:12:22] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[20:12:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[20:13:49] <wikibugs>	 (03PS2) 10Bking: elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878)
[20:14:15] <topranks>	 !log deleting irb.2018 interfaces from codfw spine switches T351534
[20:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:20] <stashbot>	 T351534: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534
[20:15:44] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[20:16:01] <wikibugs>	 (03CR) 10Bking: [C:03+2] elastic-codfw: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1013398 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[20:16:34] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:16:41] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:18:45] <jinxer-wm>	 (ProbeDown) firing: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[20:22:41] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_nginx.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:23:27] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE))
[20:23:30] <jinxer-wm>	 (ProbeDown) resolved: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:25:56] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] "Scheduled for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE))
[20:27:19] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[20:27:23] <wikibugs>	 (03Merged) 10jenkins-bot: Support legacy message box styles markup in JavaScript [skins/Vector] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013255 (https://phabricator.wikimedia.org/T360633) (owner: 10Jdlrobson)
[20:27:53] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1013255|Support legacy message box styles markup in JavaScript (T360633)]]
[20:27:57] <stashbot>	 T360633: Non-codex legacy MW message box related styles are not being applied on Vector 2022 - https://phabricator.wikimedia.org/T360633
[20:27:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[20:28:18] <wikibugs>	 06SRE, 10ChangeProp, 06Commons, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9651946 (10Krinkle)
[20:29:42] <wikibugs>	 (03PS1) 10Bking: elastic: move elastic2037 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1013401 (https://phabricator.wikimedia.org/T358882)
[20:34:53] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-b-codfw entries - cmooney@cumin1002"
[20:35:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878
[20:35:14] <stashbot>	 T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878
[20:35:41] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-b-codfw entries - cmooney@cumin1002"
[20:35:42] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:35:56] <wikibugs>	 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9651962 (10Peachey88)
[20:37:03] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] elastic: move elastic2037 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1013401 (https://phabricator.wikimedia.org/T358882) (owner: 10Bking)
[20:37:05] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878
[20:37:10] <wikibugs>	 (03CR) 10Bking: [C:03+2] elastic: move elastic2037 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1013401 (https://phabricator.wikimedia.org/T358882) (owner: 10Bking)
[20:42:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T356166)', diff saved to https://phabricator.wikimedia.org/P58891 and previous config saved to /var/cache/conftool/dbconfig/20240321-204249-marostegui.json
[20:42:54] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[20:43:57] <topranks>	 !log deleting irb.2001 and irb.2002 interfaces from codfw spine switches 
[20:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:32] <logmsgbot>	 !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1013255|Support legacy message box styles markup in JavaScript (T360633)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:45:40] <stashbot>	 T360633: Non-codex legacy MW message box related styles are not being applied on Vector 2022 - https://phabricator.wikimedia.org/T360633
[20:46:05] <cjming>	 jan_drewniak: not sure why it took so long but your patch can be tested now
[20:47:39] <jan_drewniak>	 cjming: ok looks great, good to sync
[20:47:46] <logmsgbot>	 !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync
[20:50:33] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[29,32,37-39,25-27,30,33,40-42].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002
[20:50:37] <stashbot>	 T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548
[20:57:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P58892 and previous config saved to /var/cache/conftool/dbconfig/20240321-205756-marostegui.json
[20:58:32] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:58:40] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:59:30] <Reedy>	 dancy: Hey, it seems the train might have broken captchas
[20:59:49] <dancy>	 Awesome.  Rollback needed?
[21:00:01] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic20[89-99]\.codfw\.wmnet
[21:00:03] <Reedy>	 Based on https://grafana.wikimedia.org/d/000000370/captcha-failure-rates?orgId=1 yeah
[21:00:09] <cjming>	 i'm still finishing up the window - can i finish one more config change?
[21:00:27] <Reedy>	 >18:16 dancy@deploy1002: rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.23 refs T354441
[21:00:27] <stashbot>	 T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441
[21:00:43] <Reedy>	 That 1816 seems to correlate with the bottom graph going from ~50 to 100%
[21:00:47] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic210[0-9]\.codfw\.wmnet
[21:01:15] <Reedy>	 And looks like it increases a bit in the previous ~24-48 hours (presumably as other parts of the train rolled)
[21:01:15] <dancy>	 cjming: I can roll back when your stuff is done.
[21:01:29] <topranks>	 !log adding routes to codfw row a hosts towards spine switch IPs on private1-a-codfw T351532
[21:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:41] <stashbot>	 T351532: Migrate IP gateway for public1-a-codfw to spine switches - https://phabricator.wikimedia.org/T351532
[21:02:11] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic20[89]\.codfw\.wmnet
[21:02:34] <cjming>	 dancy: thanks! just hopefully a quick config change -- the one backport seemed to take forever
[21:02:52] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic209[0-9]\.codfw\.wmnet
[21:03:01] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1013255|Support legacy message box styles markup in JavaScript (T360633)]] (duration: 35m 07s)
[21:03:05] <stashbot>	 T360633: Non-codex legacy MW message box related styles are not being applied on Vector 2022 - https://phabricator.wikimedia.org/T360633
[21:03:05] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic2089\.codfw\.wmnet
[21:03:35] <wikibugs>	 (03PS4) 10Clare Ming: ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx)
[21:03:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878
[21:04:08] <stashbot>	 T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878
[21:04:40] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:04:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx)
[21:04:57] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx)
[21:06:00] <topranks>	 !log deleting VRRP GW for 10.192.0.1 / private1-a-codfw from codfw core routers and adding to leaf switches row A T351532
[21:06:01] <wikibugs>	 (03Merged) 10jenkins-bot: ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx)
[21:06:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:18] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]]
[21:06:24] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9652068 (10bking) We're going to upgrade curator (as well as its library)...
[21:06:30] <stashbot>	 T352342:  QA WebUIScroll port to the new metrics platform - https://phabricator.wikimedia.org/T352342
[21:08:45] <logmsgbot>	 !log cjming@deploy1002 cjming and phuedx: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:08:50] <logmsgbot>	 !log cjming@deploy1002 cjming and phuedx: Continuing with sync
[21:13:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P58893 and previous config saved to /var/cache/conftool/dbconfig/20240321-211303-marostegui.json
[21:14:18] <wikibugs>	 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9652082 (10Krinkle) In MediaWiki (as deployed at WMF), there exists 1 use of Redis, which is during file uploads via...
[21:20:42] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]] (duration: 14m 24s)
[21:20:47] <stashbot>	 T352342:  QA WebUIScroll port to the new metrics platform - https://phabricator.wikimedia.org/T352342
[21:20:49] <cjming>	 !log end of UTC late backport window
[21:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:59] <cjming>	 dancy: all yours - thanks for your patience
[21:21:04] <dancy>	 thx
[21:21:30] <dancy>	 Reedy: is there a ticket for that issue?
[21:22:00] <Reedy>	 T360717
[21:22:00] <stashbot>	 T360717: CAPTCHA failure rate at 100% - https://phabricator.wikimedia.org/T360717
[21:22:04] <dancy>	 thx
[21:22:15] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013404 (https://phabricator.wikimedia.org/T354441)
[21:22:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013404 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot)
[21:22:21] <Reedy>	 Amir has noticed it seems to be doing requests to codfw, which is odd
[21:22:59] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013404 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot)
[21:23:17] <topranks>	 !log deleting irb.2017 interface from ssw1-a1-codfw and ssw1-a8-codfw 
[21:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:40] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:27:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:28:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T356166)', diff saved to https://phabricator.wikimedia.org/P58894 and previous config saved to /var/cache/conftool/dbconfig/20240321-212811-marostegui.json
[21:28:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[21:28:15] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[21:28:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[21:29:21] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[21:32:15] <Amir1>	 lol captcha failure rate is at 50% in API
[21:32:17] <Amir1>	 guess whyyyyyy
[21:32:22] <Amir1>	 Reedy: ^
[21:32:32] <Reedy>	 ?
[21:33:01] <Amir1>	 eqiad / codfw split I think
[21:34:11] <Amir1>	 dancy: are you deploying the revert? I want to check something
[21:34:26] <Amir1>	 let me know once done
[21:34:41] <dancy>	 rollback is in progress.   I just paused it before it has done anything more than update wikiversions.json 
[21:35:37] <Amir1>	 thanks!
[21:39:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye
[21:39:52] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9652188 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye
[21:41:40] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T360722 (10phaultfinder) 03NEW
[21:42:02] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:42:09] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:43:07] <Amir1>	 I have a feeling I know what's going on and I think the train rollback won't help but better to wait and make sure, once that's proven I try my thing
[21:44:05] <dancy>	 OK. I'm going to need to step out to pick up my son during the rollback.  
[21:44:39] <Amir1>	 can I do anything to move it over?
[21:44:40] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dns-rec and not P{dns1004*} and A:dnsbox
[21:44:54] <dancy>	 sure! You can run `scap train`!
[21:45:12] <dancy>	 I ended up cancelling the last run, so you could re-run and tell it that you want to be at group1 (option 3)
[21:45:27] <Amir1>	 awesome
[21:45:29] <dancy>	 or, at this stage, just `scap sync-wikiversions` is sufficient
[21:45:36] <Amir1>	 sure
[21:46:04] <dancy>	 Thanks!
[22:00:41] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9652267 (10Volans) >>! In T345337#9652068, @bking wrote: > We're going to...
[22:02:54] <wikibugs>	 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9652274 (10Ladsgroup) >>! In T360596#9652082, @Krinkle wrote: > In MediaWiki (as deployed at WMF), there exists 1 use...
[22:05:01] <logmsgbot>	 !log ladsgroup@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided)
[22:06:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2093-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[22:10:52] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9652299 (10bking) > The linked task is this same one. Did you meant to li...
[22:11:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2093-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[22:15:25] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye
[22:18:50] <wikibugs>	 (03PS2) 10Dzahn: etherpad: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413)
[22:22:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] etherpad: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[22:27:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "before:" [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[22:31:23] <wikibugs>	 (03PS2) 10Dzahn: delete etherpad.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013394 (https://phabricator.wikimedia.org/T360413)
[22:34:51] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] delete etherpad.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013394 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[22:36:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] delete etherpad.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013393 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[22:36:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9652378 (10Andrew) I ran a dist-upgrade on cloudcontrol2001, 2003, 2004, 1005, 1006, 1007.
[22:39:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878
[22:39:23] <stashbot>	 T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878
[22:39:35] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-a-codfw entries - cmooney@cumin1002"
[22:40:27] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old private1-a-codfw entries - cmooney@cumin1002"
[22:40:27] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:41:08] <mutante>	 !log etherpad - switching cert provider to cfssl
[22:41:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:36] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@9607731]: Add canary events generation dag in Airflow [airflow-dags/analytics@9607731b]
[22:42:05] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@9607731]: Add canary events generation dag in Airflow [airflow-dags/analytics@9607731b] (duration: 00m 29s)
[22:56:49] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[29,32,37-39,25-27,30,33,40-42].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002
[22:56:53] <stashbot>	 T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548
[23:09:21] <wikibugs>	 (03CR) 10Dzahn: releases: switch SSL cert provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[23:10:24] <wikibugs>	 (03PS2) 10Dzahn: releases: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413)
[23:11:32] <wikibugs>	 (03PS1) 10Dzahn: ssl: delete releases.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013414 (https://phabricator.wikimedia.org/T360413)
[23:11:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2092-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[23:12:13] <wikibugs>	 (03PS1) 10Dzahn: ssl: delete aphlict.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013415 (https://phabricator.wikimedia.org/T360413)
[23:13:09] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:13:15] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:14:56] <wikibugs>	 (03PS1) 10Dzahn: aphlict: switch envoy cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413)
[23:15:40] <wikibugs>	 (03PS1) 10Dzahn: delete aphlict.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013417 (https://phabricator.wikimedia.org/T360413)
[23:16:11] <wikibugs>	 (03PS1) 10Dzahn: delete releases.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013418 (https://phabricator.wikimedia.org/T360413)
[23:16:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2092-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[23:17:14] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:17:50] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002
[23:17:55] <stashbot>	 T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548
[23:18:02] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9652460 (10Dzahn)
[23:19:18] <wikibugs>	 (03PS1) 10Dzahn: delete doc.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013419 (https://phabricator.wikimedia.org/T360413)
[23:20:23] <wikibugs>	 (03PS1) 10Dzahn: ssl: delete doc.discovery cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013420 (https://phabricator.wikimedia.org/T360413)
[23:22:49] <wikibugs>	 (03PS1) 10Dzahn: doc: switch envoy ssl cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013421 (https://phabricator.wikimedia.org/T360413)
[23:32:45] <wikibugs>	 06SRE, 10ChangeProp, 10GitLab, 06Infrastructure-Foundations, and 8 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9652472 (10bd808)
[23:34:52] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013422 (https://phabricator.wikimedia.org/T354441)
[23:34:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] all wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013422 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot)
[23:35:39] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013422 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot)
[23:46:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:46:25] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:49:27] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:49:34] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:50:18] <logmsgbot>	 !log reedy@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.42.0-wmf.22  refs T354441
[23:50:22] <stashbot>	 T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441
[23:52:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:53:01] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:54:33] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@582ad55]: Add params to canary events pipeline [airflow-dags/analytics@582ad55c]
[23:54:58] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@582ad55]: Add params to canary events pipeline [airflow-dags/analytics@582ad55c] (duration: 00m 25s)
[23:59:51] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:59:58] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply