[00:00:29] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016371 (owner: 10TrainBranchBot)
[00:03:32] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2003.codfw.wmnet with reason: host reimage
[00:07:12] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:07:16] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:13:29] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:13:33] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:17:35] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:17:39] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:23:15] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:23:19] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:25:46] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:25:50] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:25:56] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2003.codfw.wmnet with OS bookworm
[00:30:04] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:30:08] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:36:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:37:00] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:43:35] <logmsgbot>	 !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logging-hd2002.codfw.wmnet with OS bookworm
[00:44:05] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:44:09] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:46:07] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2002.codfw.wmnet with OS bookworm
[01:00:22] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:00:26] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:05:31] <wikibugs>	 (03CR) 10Tim Starling: "Amir says" [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling)
[01:06:14] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:06:18] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:09:20] <wikibugs>	 (03PS1) 10TChin: [WIP] Add datasets-config helm chart and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434)
[01:10:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Add datasets-config helm chart and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[01:15:59] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2002.codfw.wmnet with reason: host reimage
[01:19:03] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2002.codfw.wmnet with reason: host reimage
[01:40:39] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2002.codfw.wmnet with OS bookworm
[01:44:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 834.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:49:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 869.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:51:23] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:51:27] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:58:46] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:58:50] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:32:18] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:37:22] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:58] <wikibugs>	 (03PS1) 10Krinkle: codesearch: Set CODESEARCH_HOUND_BASE for codesearch-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1016480
[02:45:04] <wikibugs>	 (03PS2) 10Krinkle: codesearch: Set CODESEARCH_HOUND_BASE for codesearch-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1016480
[02:45:07] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (owner: 10Krinkle)
[02:50:41] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[03:02:22] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:53:48] <wikibugs>	 (03PS3) 10Krinkle: [WIP] codesearch: Set CODESEARCH_HOUND_BASE for codesearch-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1016480
[04:03:44] <wikibugs>	 (03CR) 10Krinkle: "Based on the below test, I believe this would not work currently." [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (owner: 10Krinkle)
[04:32:03] <wikibugs>	 (03PS2) 10Tim Starling: WMCS: Read from the new block/block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034)
[04:32:03] <wikibugs>	 (03CR) 10Tim Starling: "I tested it locally using I0d9afa97a4566e9c9fd8cd812b5fcb8698eaf4f9. Now I'm moderately confident and ready for it to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling)
[04:35:46] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665 (10AndyRussG) 03NEW
[04:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:09:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[05:10:05] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[05:10:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[05:10:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[05:10:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T356166)', diff saved to https://phabricator.wikimedia.org/P59238 and previous config saved to /var/cache/conftool/dbconfig/20240403-051029-marostegui.json
[05:10:32] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[05:11:21] <wikibugs>	 (03PS1) 10Marostegui: db1222: Upgrade to Bookworm and MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016489 (https://phabricator.wikimedia.org/T361543)
[05:11:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1222 T361543', diff saved to https://phabricator.wikimedia.org/P59239 and previous config saved to /var/cache/conftool/dbconfig/20240403-051149-root.json
[05:11:53] <stashbot>	 T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543
[05:12:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1222: Upgrade to Bookworm and MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016489 (https://phabricator.wikimedia.org/T361543) (owner: 10Marostegui)
[05:13:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1222.eqiad.wmnet with OS bookworm
[05:16:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1222: Upgrade to Bookworm and MariaDB 10.6" [puppet] - 10https://gerrit.wikimedia.org/r/1016506
[05:25:52] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage
[05:28:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage
[05:31:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:42:52] <wikibugs>	 (03PS1) 10Marostegui: db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016491
[05:43:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2148 T361543', diff saved to https://phabricator.wikimedia.org/P59240 and previous config saved to /var/cache/conftool/dbconfig/20240403-054310-root.json
[05:43:14] <stashbot>	 T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543
[05:43:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016491 (owner: 10Marostegui)
[05:44:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2148.codfw.wmnet with OS bookworm
[05:46:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59241 and previous config saved to /var/cache/conftool/dbconfig/20240403-054641-root.json
[05:47:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1222: Upgrade to Bookworm and MariaDB 10.6" [puppet] - 10https://gerrit.wikimedia.org/r/1016506 (owner: 10Marostegui)
[05:48:21] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016507
[05:49:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1222.eqiad.wmnet with OS bookworm
[05:50:15] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format es2037 [puppet] - 10https://gerrit.wikimedia.org/r/1016492
[05:50:59] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:51:03] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:53:23] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format es2037 [puppet] - 10https://gerrit.wikimedia.org/r/1016492 (owner: 10Marostegui)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T0600)
[06:01:10] <hashar>	 jouncebot: next
[06:01:10] <jouncebot>	 In 0 hour(s) and 58 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T0700)
[06:01:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2148.codfw.wmnet with reason: host reimage
[06:01:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59242 and previous config saved to /var/cache/conftool/dbconfig/20240403-060147-root.json
[06:04:21] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:04:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2148.codfw.wmnet with reason: host reimage
[06:05:26] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[06:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:10:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T356166)', diff saved to https://phabricator.wikimedia.org/P59243 and previous config saved to /var/cache/conftool/dbconfig/20240403-061055-marostegui.json
[06:11:00] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[06:13:38] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:13:42] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:16:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59244 and previous config saved to /var/cache/conftool/dbconfig/20240403-061653-root.json
[06:23:52] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:23:57] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:24:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016507 (owner: 10Marostegui)
[06:24:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59245 and previous config saved to /var/cache/conftool/dbconfig/20240403-062436-root.json
[06:25:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2148.codfw.wmnet with OS bookworm
[06:26:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P59246 and previous config saved to /var/cache/conftool/dbconfig/20240403-062602-marostegui.json
[06:31:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59247 and previous config saved to /var/cache/conftool/dbconfig/20240403-063159-root.json
[06:32:18] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:39:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59248 and previous config saved to /var/cache/conftool/dbconfig/20240403-063941-root.json
[06:41:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P59249 and previous config saved to /var/cache/conftool/dbconfig/20240403-064110-marostegui.json
[06:47:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59250 and previous config saved to /var/cache/conftool/dbconfig/20240403-064704-root.json
[06:54:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59251 and previous config saved to /var/cache/conftool/dbconfig/20240403-065447-root.json
[06:56:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T356166)', diff saved to https://phabricator.wikimedia.org/P59252 and previous config saved to /var/cache/conftool/dbconfig/20240403-065617-marostegui.json
[06:56:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[06:56:20] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[06:56:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[06:56:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[06:56:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[06:57:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T356166)', diff saved to https://phabricator.wikimedia.org/P59253 and previous config saved to /var/cache/conftool/dbconfig/20240403-065706-marostegui.json
[06:59:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T356166)', diff saved to https://phabricator.wikimedia.org/P59254 and previous config saved to /var/cache/conftool/dbconfig/20240403-065923-marostegui.json
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59255 and previous config saved to /var/cache/conftool/dbconfig/20240403-070212-root.json
[07:02:26] <jinxer-wm>	 (RoutinatorRRDPErrors) firing: Routinator RRDP fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RRDP_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRRDPErrors
[07:09:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2125 T361543', diff saved to https://phabricator.wikimedia.org/P59256 and previous config saved to /var/cache/conftool/dbconfig/20240403-070946-root.json
[07:09:50] <stashbot>	 T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543
[07:09:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59257 and previous config saved to /var/cache/conftool/dbconfig/20240403-070953-root.json
[07:10:42] <wikibugs>	 (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016620
[07:11:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016620 (owner: 10Marostegui)
[07:11:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2125.codfw.wmnet with OS bookworm
[07:11:47] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2202.codfw.wmnet with OS bookworm
[07:14:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59258 and previous config saved to /var/cache/conftool/dbconfig/20240403-071431-marostegui.json
[07:16:53] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2100 after memory failure [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) (owner: 10Arnaudb)
[07:17:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59259 and previous config saved to /var/cache/conftool/dbconfig/20240403-071718-root.json
[07:18:04] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2100.codfw.wmnet
[07:20:46] <wikibugs>	 (03PS1) 10Slyngshede: SSH keys: Provide feedback on actions. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966)
[07:22:07] <wikibugs>	 (03CR) 10Slyngshede: "I apparently lost the patch for adding messages to the user on key operations, so I had to redo it." [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) (owner: 10Slyngshede)
[07:22:26] <jinxer-wm>	 (RoutinatorRRDPErrors) firing: (2) Routinator RRDP fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RRDP_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRRDPErrors
[07:24:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59260 and previous config saved to /var/cache/conftool/dbconfig/20240403-072459-root.json
[07:25:26] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[07:26:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Record updated contract end for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1016702 (https://phabricator.wikimedia.org/T361527)
[07:27:26] <jinxer-wm>	 (RoutinatorRRDPErrors) firing: (2) Routinator RRDP fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RRDP_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRRDPErrors
[07:27:46] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2202.codfw.wmnet with reason: host reimage
[07:28:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C:04-1] "Given some context I'm seeing in the tickets (about spicerack using curator; I don't yet understand why), this feels like a risky change. " [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper)
[07:28:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2125.codfw.wmnet with reason: host reimage
[07:29:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59261 and previous config saved to /var/cache/conftool/dbconfig/20240403-072938-marostegui.json
[07:30:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record updated contract end for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1016702 (https://phabricator.wikimedia.org/T361527) (owner: 10Muehlenhoff)
[07:32:26] <jinxer-wm>	 (RoutinatorRRDPErrors) resolved: (2) Routinator RRDP fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RRDP_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRRDPErrors
[07:32:29] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: host reimage
[07:35:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2125.codfw.wmnet with reason: host reimage
[07:37:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) (owner: 10Slyngshede)
[07:37:45] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] SSH keys: Provide feedback on actions. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) (owner: 10Slyngshede)
[07:39:00] <wikibugs>	 (03Merged) 10jenkins-bot: SSH keys: Provide feedback on actions. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) (owner: 10Slyngshede)
[07:40:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59262 and previous config saved to /var/cache/conftool/dbconfig/20240403-074004-root.json
[07:44:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T356166)', diff saved to https://phabricator.wikimedia.org/P59263 and previous config saved to /var/cache/conftool/dbconfig/20240403-074446-marostegui.json
[07:44:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[07:44:50] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[07:45:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[07:45:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T356166)', diff saved to https://phabricator.wikimedia.org/P59264 and previous config saved to /var/cache/conftool/dbconfig/20240403-074509-marostegui.json
[07:47:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T356166)', diff saved to https://phabricator.wikimedia.org/P59265 and previous config saved to /var/cache/conftool/dbconfig/20240403-074727-marostegui.json
[07:53:40] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2202.codfw.wmnet with OS bookworm
[07:55:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59266 and previous config saved to /var/cache/conftool/dbconfig/20240403-075510-root.json
[07:56:24] <wikibugs>	 (03PS1) 10Majavah: hieradata: Upgrade clouddb2002-dev to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016705
[07:56:28] <wikibugs>	 (03PS1) 10Marostegui: common.yaml: Add cu_useragent to private tables [puppet] - 10https://gerrit.wikimedia.org/r/1016706 (https://phabricator.wikimedia.org/T361673)
[07:58:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2125.codfw.wmnet with OS bookworm
[07:59:08] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1789/console" [puppet] - 10https://gerrit.wikimedia.org/r/1016705 (owner: 10Majavah)
[08:00:04] <jouncebot>	 jnuche and jeena: Time to do the MediaWiki train - Utc-0+Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T0800).
[08:00:25] <jnuche>	 morning, train deploy in a few minutes
[08:00:47] <wikibugs>	 (03PS2) 10Majavah: Upgrade clouddb2002-dev to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016705 (https://phabricator.wikimedia.org/T361666)
[08:01:13] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox
[08:02:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:02:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59267 and previous config saved to /var/cache/conftool/dbconfig/20240403-080235-marostegui.json
[08:04:32] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016707 (https://phabricator.wikimedia.org/T360157)
[08:04:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016707 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot)
[08:04:58] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1790/co" [puppet] - 10https://gerrit.wikimedia.org/r/1016705 (https://phabricator.wikimedia.org/T361666) (owner: 10Majavah)
[08:05:32] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016707 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot)
[08:05:39] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] Upgrade clouddb2002-dev to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016705 (https://phabricator.wikimedia.org/T361666) (owner: 10Majavah)
[08:07:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:07:06] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016512
[08:09:16] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2100.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002"
[08:10:20] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2100.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002"
[08:10:20] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:10:21] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2100.codfw.wmnet
[08:10:45] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] common.yaml: Add cu_useragent to private tables [puppet] - 10https://gerrit.wikimedia.org/r/1016706 (https://phabricator.wikimedia.org/T361673) (owner: 10Marostegui)
[08:11:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] common.yaml: Add cu_useragent to private tables [puppet] - 10https://gerrit.wikimedia.org/r/1016706 (https://phabricator.wikimedia.org/T361673) (owner: 10Marostegui)
[08:11:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016512 (owner: 10Marostegui)
[08:11:47] <wikibugs>	 10ops-codfw, 06DBA, 10decommission-hardware, 13Patch-For-Review: decommission db2100.codfw.wmnet - https://phabricator.wikimedia.org/T361584#9683193 (10ABran-WMF) 05In progress→03Open a:05ABran-WMF→03None
[08:12:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59268 and previous config saved to /var/cache/conftool/dbconfig/20240403-081207-root.json
[08:12:22] <wikibugs>	 (03PS1) 10Ayounsi: Routed Ganeti: fix v6 route install [puppet] - 10https://gerrit.wikimedia.org/r/1016708 (https://phabricator.wikimedia.org/T300152)
[08:14:16] <wikibugs>	 (03PS2) 10Ayounsi: Routed Ganeti: fix v6 route install [puppet] - 10https://gerrit.wikimedia.org/r/1016708 (https://phabricator.wikimedia.org/T300152)
[08:14:33] <godog>	 jouncebot: next
[08:14:33] <jouncebot>	 In 1 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1000)
[08:15:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: bump ops prometheus retention_size [puppet] - 10https://gerrit.wikimedia.org/r/1016304 (https://phabricator.wikimedia.org/T360537) (owner: 10Filippo Giunchedi)
[08:15:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster1002.eqiad.wmnet
[08:16:45] <godog>	 !log roll-restart prometheus/ops in codfw/eqiad to apply new retention settings - T360537
[08:17:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59269 and previous config saved to /var/cache/conftool/dbconfig/20240403-081742-marostegui.json
[08:19:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:20:00] <jinxer-wm>	 (ProbeDown) firing: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:20:24] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.25  refs T360157
[08:21:54] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/1016708 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:23:55] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cp3067: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015969 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh)
[08:24:12] <fabfur>	 !log depool cp3067 for reimage (T360430)
[08:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:15] <stashbot>	 T360430: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430
[08:24:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:24:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:24:25] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: fix v6 route install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016708 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:25:42] <fabfur>	 XioNoX: I have a change ready to be merged on puppetmaster with also yours
[08:25:43] <wikibugs>	 (03PS1) 10Majavah: Bind mariadb on clouddb2002-dev to the IPv4 address [puppet] - 10https://gerrit.wikimedia.org/r/1016712
[08:25:46] <fabfur>	 it's ok for you? 
[08:25:54] <XioNoX>	 fabfur: yep
[08:25:55] <XioNoX>	 thx
[08:26:07] <fabfur>	 ahaha sorry, switching from one channel to another
[08:26:09] <fabfur>	 I'll go
[08:27:03] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3067.esams.wmnet
[08:27:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59270 and previous config saved to /var/cache/conftool/dbconfig/20240403-082712-root.json
[08:28:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:29:44] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Bind mariadb on clouddb2002-dev to the IPv4 address [puppet] - 10https://gerrit.wikimedia.org/r/1016712 (owner: 10Majavah)
[08:29:46] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3067.esams.wmnet with OS bullseye
[08:29:56] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9683295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3067.esams.wmnet with OS bullseye
[08:30:00] <jinxer-wm>	 (ProbeDown) resolved: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:30:45] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[08:30:58] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[08:31:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:31:16] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:31:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T360332)', diff saved to https://phabricator.wikimedia.org/P59271 and previous config saved to /var/cache/conftool/dbconfig/20240403-083123-arnaudb.json
[08:31:29] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[08:31:42] <wikibugs>	 (03PS1) 10Slyngshede: Add svg files to packages. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016713
[08:32:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T356166)', diff saved to https://phabricator.wikimedia.org/P59272 and previous config saved to /var/cache/conftool/dbconfig/20240403-083249-marostegui.json
[08:32:52] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[08:32:53] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[08:33:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:33:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:33:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster1002.eqiad.wmnet
[08:33:06] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[08:33:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093#9683311 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster1002.eqiad.wmnet` - puppetmaster10...
[08:33:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T356166)', diff saved to https://phabricator.wikimedia.org/P59273 and previous config saved to /var/cache/conftool/dbconfig/20240403-083313-marostegui.json
[08:33:25] <logmsgbot>	 !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.42.0-wmf.25  refs T360157 (duration: 13m 00s)
[08:33:28] <stashbot>	 T360157: 1.42.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T360157
[08:33:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T360332)', diff saved to https://phabricator.wikimedia.org/P59274 and previous config saved to /var/cache/conftool/dbconfig/20240403-083343-arnaudb.json
[08:35:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1016713 (owner: 10Slyngshede)
[08:35:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T356166)', diff saved to https://phabricator.wikimedia.org/P59275 and previous config saved to /var/cache/conftool/dbconfig/20240403-083530-marostegui.json
[08:35:46] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Add svg files to packages. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016713 (owner: 10Slyngshede)
[08:36:18] <marostegui>	 !log stop sanitarium codfw hosts T361673
[08:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:20] <stashbot>	 T361673: Filter cu_useragent on sanitarium - https://phabricator.wikimedia.org/T361673
[08:36:54] <wikibugs>	 (03Merged) 10jenkins-bot: Add svg files to packages. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016713 (owner: 10Slyngshede)
[08:40:39] <wikibugs>	 (03PS1) 10Ayounsi: Add routed Ganeti to Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1016714 (https://phabricator.wikimedia.org/T300152)
[08:42:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59276 and previous config saved to /var/cache/conftool/dbconfig/20240403-084218-root.json
[08:48:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P59278 and previous config saved to /var/cache/conftool/dbconfig/20240403-084851-arnaudb.json
[08:50:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59279 and previous config saved to /var/cache/conftool/dbconfig/20240403-085037-marostegui.json
[08:51:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:52:42] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3067.esams.wmnet with reason: host reimage
[08:52:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1016714 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:55:12] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add routed Ganeti to Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1016714 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:55:58] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3067.esams.wmnet with reason: host reimage
[08:56:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:57:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59280 and previous config saved to /var/cache/conftool/dbconfig/20240403-085723-root.json
[09:00:10] <slyngs>	 !log Upgraded Bitu / idm.wikimedia.org to version 0.0.6-2
[09:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:03:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:03:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P59281 and previous config saved to /var/cache/conftool/dbconfig/20240403-090358-arnaudb.json
[09:04:00] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:05:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59282 and previous config saved to /var/cache/conftool/dbconfig/20240403-090545-marostegui.json
[09:06:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:12:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59283 and previous config saved to /var/cache/conftool/dbconfig/20240403-091229-root.json
[09:12:44] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1016372 (https://phabricator.wikimedia.org/T361682)
[09:13:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): 14Decommission puppetmaster1002 - 14https://phabricator.wikimedia.org/T357093#9683402 (10MoritzMuehlenhoff) 05Open→03Resolved 14puppetmaster1002 has been decommissioned.
[09:14:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: 14Connection errors from puppetmaster1002 to puppetdb - 14https://phabricator.wikimedia.org/T358187#9683417 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff 14We never got to the bottom of this error, it was likely a hardwa...
[09:17:17] <wikibugs>	 10ops-eqiad, 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093#9683422 (10MoritzMuehlenhoff) 05Resolved→03Open a:05MoritzMuehlenhoff→03Jclark-ctr
[09:18:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster1002 from puppetdb ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1016716 (https://phabricator.wikimedia.org/T357093)
[09:18:56] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:19:00] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:19:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T360332)', diff saved to https://phabricator.wikimedia.org/P59284 and previous config saved to /var/cache/conftool/dbconfig/20240403-091906-arnaudb.json
[09:19:09] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[09:19:09] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[09:19:22] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[09:19:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59285 and previous config saved to /var/cache/conftool/dbconfig/20240403-091929-arnaudb.json
[09:19:45] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3067.esams.wmnet with OS bullseye
[09:19:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster1002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1016717 (https://phabricator.wikimedia.org/T357093)
[09:19:56] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9683438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3067.esams.wmnet with OS bullseye completed: - cp3067 (**PASS**)...
[09:20:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T356166)', diff saved to https://phabricator.wikimedia.org/P59286 and previous config saved to /var/cache/conftool/dbconfig/20240403-092053-marostegui.json
[09:20:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[09:20:56] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[09:21:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[09:21:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T356166)', diff saved to https://phabricator.wikimedia.org/P59287 and previous config saved to /var/cache/conftool/dbconfig/20240403-092116-marostegui.json
[09:21:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1016717 (https://phabricator.wikimedia.org/T357093) (owner: 10Muehlenhoff)
[09:21:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59288 and previous config saved to /var/cache/conftool/dbconfig/20240403-092149-arnaudb.json
[09:21:51] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Investigate Ganeti in routed mode - 14https://phabricator.wikimedia.org/T300152#9683434 (10ayounsi) 05Open→03Resolved 14We can consider this task completed with success.  Next step is to discuss the next steps and ope...
[09:23:09] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[09:23:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T356166)', diff saved to https://phabricator.wikimedia.org/P59289 and previous config saved to /var/cache/conftool/dbconfig/20240403-092334-marostegui.json
[09:24:07] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9683452 (10Fabfur)
[09:24:19] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3067.esams.wmnet
[09:27:12] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1037.eqiad.wmnet with OS bookworm
[09:27:25] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1037.eqiad.wmnet...
[09:27:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59290 and previous config saved to /var/cache/conftool/dbconfig/20240403-092735-root.json
[09:27:37] <marostegui>	 !log Restart sanitarium  db1155 T361673
[09:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:40] <stashbot>	 T361673: Filter cu_useragent on sanitarium - https://phabricator.wikimedia.org/T361673
[09:27:43] <wikibugs>	 (03PS1) 10David Caro: containerd: export the crictl endpoint in profile.d [puppet] - 10https://gerrit.wikimedia.org/r/1016719
[09:28:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016720 (https://phabricator.wikimedia.org/T319184)
[09:29:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1016719 (owner: 10David Caro)
[09:30:20] <wikibugs>	 (03CR) 10David Caro: [C:03+1] cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016720 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[09:31:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:31:36] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:31:40] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:31:57] <wikibugs>	 (03PS1) 10JMeybohm: k8s/apiserver: Fix parameter syntax for --audit-log-maxsize [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507)
[09:32:04] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[09:32:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s/apiserver: Fix parameter syntax for --audit-log-maxsize [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[09:32:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016720 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[09:32:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos)
[09:32:55] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[09:33:52] <wikibugs>	 (03PS2) 10JMeybohm: k8s/apiserver: Fix parameter syntax for --audit-log-maxsize [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507)
[09:33:56] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[09:34:36] <wikibugs>	 (03PS1) 10Muehlenhoff: debmonitor: Remove obsolete discovery certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016723 (https://phabricator.wikimedia.org/T357750)
[09:34:50] <wikibugs>	 (03PS1) 10Slavina Stefanova: harbor: upgrade from 2.9.0 to 2.10.1 [puppet] - 10https://gerrit.wikimedia.org/r/1016724 (https://phabricator.wikimedia.org/T354507)
[09:34:55] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] k8s/apiserver: Fix parameter syntax for --audit-log-maxsize [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[09:35:50] <wikibugs>	 (03PS2) 10Muehlenhoff: debmonitor: Remove obsolete discovery certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016723 (https://phabricator.wikimedia.org/T357750)
[09:36:18] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016723 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[09:36:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P59291 and previous config saved to /var/cache/conftool/dbconfig/20240403-093657-arnaudb.json
[09:37:34] <wikibugs>	 (03PS3) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[09:38:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P59292 and previous config saved to /var/cache/conftool/dbconfig/20240403-093842-marostegui.json
[09:38:47] <godog>	 jouncebot: next
[09:38:48] <jouncebot>	 In 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1000)
[09:39:30] <godog>	 !log roll-restart prometheus/k8s in codfw/eqiad to apply new retention settings - T360537
[09:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:33] <stashbot>	 T360537: Bump prometheus instances allocated space - https://phabricator.wikimedia.org/T360537
[09:39:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: bump k8s prometheus retention_size [puppet] - 10https://gerrit.wikimedia.org/r/1016305 (https://phabricator.wikimedia.org/T360537) (owner: 10Filippo Giunchedi)
[09:40:20] <wikibugs>	 (03PS4) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[09:41:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:42:35] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1037
[09:42:39] <wikibugs>	 (03PS5) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[09:42:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59293 and previous config saved to /var/cache/conftool/dbconfig/20240403-094241-root.json
[09:43:00] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1037
[09:44:05] <wikibugs>	 (03CR) 10Majavah: [C:03+2] php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle)
[09:44:40] <Dreamy_Jazz>	 jouncebot: next
[09:44:40] <jouncebot>	 In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1000)
[09:44:43] <wikibugs>	 (03Merged) 10jenkins-bot: php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle)
[09:44:51] <wikibugs>	 (03PS5) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614
[09:45:13] <Dreamy_Jazz>	 !log Doing security deploy for T361293
[09:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:31] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage
[09:48:15] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage
[09:48:30] <wikibugs>	 (03PS6) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[09:49:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove dummy cert for debmonitor [labs/private] - 10https://gerrit.wikimedia.org/r/1016726 (https://phabricator.wikimedia.org/T357750)
[09:50:16] <wikibugs>	 (03PS1) 10Mvolz: Update zotero to node18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016728 (https://phabricator.wikimedia.org/T349118)
[09:51:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:51:55] <wikibugs>	 (03CR) 10David Caro: [C:03+2] containerd: export the crictl endpoint in profile.d [puppet] - 10https://gerrit.wikimedia.org/r/1016719 (owner: 10David Caro)
[09:52:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P59294 and previous config saved to /var/cache/conftool/dbconfig/20240403-095204-arnaudb.json
[09:52:27] <wikibugs>	 (03CR) 10David Caro: [C:03+2] "Tested in tools" [puppet] - 10https://gerrit.wikimedia.org/r/1016719 (owner: 10David Caro)
[09:53:21] <wikibugs>	 (03CR) 10Jgiannelos: "This is the missing config section to enable caching in PCS staging. From the CI output it looks like the template generates whats expecte" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos)
[09:53:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P59295 and previous config saved to /var/cache/conftool/dbconfig/20240403-095349-marostegui.json
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1000)
[10:06:37] <logmsgbot>	 !log dreamyjazz Deployed security patch for T361293
[10:07:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59296 and previous config saved to /var/cache/conftool/dbconfig/20240403-100712-arnaudb.json
[10:07:15] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[10:07:15] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[10:07:28] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[10:07:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59297 and previous config saved to /var/cache/conftool/dbconfig/20240403-100735-arnaudb.json
[10:08:29] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm - it might be nice to add a .fixture entry to show this feature being enabled for testing purposes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos)
[10:08:31] <wikibugs>	 (03PS2) 10Muehlenhoff: analytics_cluster::coordinator: Configure Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619)
[10:08:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T356166)', diff saved to https://phabricator.wikimedia.org/P59298 and previous config saved to /var/cache/conftool/dbconfig/20240403-100857-marostegui.json
[10:08:58] <moritzm>	 !log installing util-linux security updates
[10:08:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[10:09:00] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[10:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[10:09:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T356166)', diff saved to https://phabricator.wikimedia.org/P59299 and previous config saved to /var/cache/conftool/dbconfig/20240403-100919-marostegui.json
[10:10:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59300 and previous config saved to /var/cache/conftool/dbconfig/20240403-100959-arnaudb.json
[10:10:29] <marostegui>	 !log Restart sanitarium  db1154 T361673
[10:10:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:32] <stashbot>	 T361673: Filter cu_useragent on sanitarium - https://phabricator.wikimedia.org/T361673
[10:11:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T356166)', diff saved to https://phabricator.wikimedia.org/P59301 and previous config saved to /var/cache/conftool/dbconfig/20240403-101137-marostegui.json
[10:14:04] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:14:15] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:14:27] <wikibugs>	 (03CR) 10Volans: "Code looks sane, I would love to see it in action, but if you tested in your lab that's enough for me. One question and minor nits/suggest" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French)
[10:17:19] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1037.eqiad.wmnet with OS bookworm
[10:17:34] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1037.eqiad.wmnet with...
[10:18:49] <wikibugs>	 (03PS7) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[10:19:15] <wikibugs>	 (03PS8) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[10:19:51] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, better explicit than implicit and we could split it" [puppet] - 10https://gerrit.wikimedia.org/r/1016456 (owner: 10Scott French)
[10:20:00] <logmsgbot>	 !log dreamyjazz Deployed security patch for T361293
[10:20:40] <wikibugs>	 (03PS9) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507)
[10:24:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 841.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:25:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P59302 and previous config saved to /var/cache/conftool/dbconfig/20240403-102507-arnaudb.json
[10:25:21] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683651 (10aborrero)
[10:25:30] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos)
[10:26:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P59303 and previous config saved to /var/cache/conftool/dbconfig/20240403-102644-marostegui.json
[10:27:15] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos)
[10:28:07] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos)
[10:29:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 813.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:29:41] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[10:29:45] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[10:35:41] <wikibugs>	 (03PS7) 10Stevemunene: Decommission an-coord100[12] The change includes removal of an-coord100[1-2] mentions in comments and references. [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol)
[10:37:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol)
[10:38:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Decommission an-coord100[12] The change includes removal of an-coord100[1-2] mentions in comments and references. [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol)
[10:38:44] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1016457 (owner: 10Scott French)
[10:39:40] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Nice! We could also keep it for the migration." [puppet] - 10https://gerrit.wikimedia.org/r/1016458 (owner: 10Scott French)
[10:40:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P59304 and previous config saved to /var/cache/conftool/dbconfig/20240403-104014-arnaudb.json
[10:40:42] <wikibugs>	 (03PS1) 10Stevemunene: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1016741 (https://phabricator.wikimedia.org/T353774)
[10:41:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P59305 and previous config saved to /var/cache/conftool/dbconfig/20240403-104152-marostegui.json
[10:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 837.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:47:03] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1038.eqiad.wmnet with OS bookworm
[10:47:37] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[10:50:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 837.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:55:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59306 and previous config saved to /var/cache/conftool/dbconfig/20240403-105522-arnaudb.json
[10:55:24] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[10:55:31] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[10:55:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[10:55:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T360332)', diff saved to https://phabricator.wikimedia.org/P59307 and previous config saved to /var/cache/conftool/dbconfig/20240403-105545-arnaudb.json
[10:57:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T356166)', diff saved to https://phabricator.wikimedia.org/P59308 and previous config saved to /var/cache/conftool/dbconfig/20240403-105659-marostegui.json
[10:57:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[10:57:05] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[10:57:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[10:57:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T356166)', diff saved to https://phabricator.wikimedia.org/P59309 and previous config saved to /var/cache/conftool/dbconfig/20240403-105722-marostegui.json
[10:57:45] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[10:57:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1016741 (https://phabricator.wikimedia.org/T353774) (owner: 10Stevemunene)
[10:58:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T360332)', diff saved to https://phabricator.wikimedia.org/P59310 and previous config saved to /var/cache/conftool/dbconfig/20240403-105804-arnaudb.json
[10:58:46] <wikibugs>	 (03Abandoned) 10Stevemunene: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1016741 (https://phabricator.wikimedia.org/T353774) (owner: 10Stevemunene)
[10:59:01] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1038.eqiad.wmnet...
[10:59:24] <wikibugs>	 (03PS8) 10Stevemunene: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol)
[10:59:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016743 (https://phabricator.wikimedia.org/T319184)
[10:59:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T356166)', diff saved to https://phabricator.wikimedia.org/P59311 and previous config saved to /var/cache/conftool/dbconfig/20240403-105940-marostegui.json
[11:00:05] <jouncebot>	 mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1100).
[11:00:48] <wikibugs>	 (03CR) 10Majavah: [C:03+1] cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016743 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:01:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016743 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:01:53] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "Updating my +1 after testing the latest changes!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[11:02:55] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Remove now obsolete site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/1016293 (https://phabricator.wikimedia.org/T341895) (owner: 10Muehlenhoff)
[11:04:01] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage
[11:05:12] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[11:05:14] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[11:06:26] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Update zotero to node18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016728 (https://phabricator.wikimedia.org/T349118) (owner: 10Mvolz)
[11:07:07] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage
[11:07:33] <wikibugs>	 (03Merged) 10jenkins-bot: Update zotero to node18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016728 (https://phabricator.wikimedia.org/T349118) (owner: 10Mvolz)
[11:08:01] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[11:08:03] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[11:09:51] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@75163c7]: (no justification provided)
[11:10:23] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@75163c7]: (no justification provided) (duration: 00m 32s)
[11:11:25] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[11:11:50] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[11:13:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P59312 and previous config saved to /var/cache/conftool/dbconfig/20240403-111312-arnaudb.json
[11:13:27] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[11:13:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove now obsolete site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/1016293 (https://phabricator.wikimedia.org/T341895) (owner: 10Muehlenhoff)
[11:14:07] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[11:14:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P59313 and previous config saved to /var/cache/conftool/dbconfig/20240403-111447-marostegui.json
[11:15:30] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[11:16:03] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[11:16:07] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1038
[11:16:31] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1038
[11:17:54] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944 (owner: 10Scott French)
[11:19:30] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016374
[11:19:54] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016374 (owner: 10PipelineBot)
[11:20:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:21:00] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016374 (owner: 10PipelineBot)
[11:22:46] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015457 (owner: 10PipelineBot)
[11:23:45] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[11:23:55] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015457 (owner: 10PipelineBot)
[11:23:57] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[11:24:04] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014058 (owner: 10PipelineBot)
[11:24:23] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[11:25:14] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[11:25:26] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:27:09] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[11:27:33] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:28:06] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply
[11:28:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P59314 and previous config saved to /var/cache/conftool/dbconfig/20240403-112819-arnaudb.json
[11:28:43] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[11:29:15] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:29:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P59315 and previous config saved to /var/cache/conftool/dbconfig/20240403-112955-marostegui.json
[11:30:02] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:30:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:33:03] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1038.eqiad.wmnet with OS bookworm
[11:35:17] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1038.eqiad.wmnet with...
[11:35:31] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683893 (10aborrero)
[11:35:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] ores: Remove old ORES DNS entries [dns] - 10https://gerrit.wikimedia.org/r/1016389 (owner: 10Alexandros Kosiaris)
[11:36:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks for the +1" [dns] - 10https://gerrit.wikimedia.org/r/1016389 (owner: 10Alexandros Kosiaris)
[11:37:14] <wikibugs>	 (03CR) 10Majavah: [C:03+1] "typo inline, otherwise LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans)
[11:38:17] <wikibugs>	 (03PS2) 10Volans: puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293)
[11:38:56] <wikibugs>	 (03CR) 10Volans: "fixed typo" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans)
[11:43:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T360332)', diff saved to https://phabricator.wikimedia.org/P59317 and previous config saved to /var/cache/conftool/dbconfig/20240403-114327-arnaudb.json
[11:43:30] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[11:43:32] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[11:43:43] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[11:43:51] <moritzm>	 !log installing imagemagick security updates
[11:43:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T360332)', diff saved to https://phabricator.wikimedia.org/P59318 and previous config saved to /var/cache/conftool/dbconfig/20240403-114350-arnaudb.json
[11:43:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T356166)', diff saved to https://phabricator.wikimedia.org/P59319 and previous config saved to /var/cache/conftool/dbconfig/20240403-114502-marostegui.json
[11:45:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[11:45:10] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[11:45:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[11:45:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T356166)', diff saved to https://phabricator.wikimedia.org/P59320 and previous config saved to /var/cache/conftool/dbconfig/20240403-114525-marostegui.json
[11:46:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T360332)', diff saved to https://phabricator.wikimedia.org/P59321 and previous config saved to /var/cache/conftool/dbconfig/20240403-114611-arnaudb.json
[11:47:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T356166)', diff saved to https://phabricator.wikimedia.org/P59322 and previous config saved to /var/cache/conftool/dbconfig/20240403-114743-marostegui.json
[11:47:59] <wikibugs>	 (03CR) 10Slavina Stefanova: "tested on toolsbeta" [puppet] - 10https://gerrit.wikimedia.org/r/1016724 (https://phabricator.wikimedia.org/T354507) (owner: 10Slavina Stefanova)
[11:50:39] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1039.eqiad.wmnet with OS bookworm
[11:50:57] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1039.eqiad.wmnet...
[11:52:47] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[11:52:52] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[11:53:26] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1039: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016751 (https://phabricator.wikimedia.org/T319184)
[11:54:36] <wikibugs>	 (03CR) 10David Caro: [C:03+1] cloudvirt1039: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016751 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:54:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1039: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016751 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:55:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:55:54] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683981 (10aborrero)
[11:57:01] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: add logstash_oidc client [puppet] - 10https://gerrit.wikimedia.org/r/1016301 (https://phabricator.wikimedia.org/T337818)
[11:57:18] <wikibugs>	 (03PS10) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506)
[11:57:21] <wikibugs>	 (03PS4) 10Filippo Giunchedi: Use oauth2-proxy for opensearch dashboards [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818)
[11:57:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you, I've added the vhost at Ib82d2a93" [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[11:58:42] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1039
[11:59:39] <wikibugs>	 (03PS11) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506)
[12:01:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P59323 and previous config saved to /var/cache/conftool/dbconfig/20240403-120118-arnaudb.json
[12:02:15] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[12:02:19] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:02:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P59324 and previous config saved to /var/cache/conftool/dbconfig/20240403-120251-marostegui.json
[12:05:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:07:18] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:07:30] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1039
[12:07:42] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] WMCS: Read from the new block/block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling)
[12:08:17] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Enable audit logging in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507)
[12:08:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s: Enable audit logging in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:08:58] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage
[12:09:05] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Enable audit logging in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507)
[12:10:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:11:40] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage
[12:11:58] <wikibugs>	 (03CR) 10Ayounsi: Netbox: add functions to get and set device name (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi)
[12:14:01] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans)
[12:14:34] <wikibugs>	 (03CR) 10Majavah: [C:03+1] puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans)
[12:15:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:16:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P59325 and previous config saved to /var/cache/conftool/dbconfig/20240403-121626-arnaudb.json
[12:16:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] analytics_cluster::coordinator: Configure Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:17:51] <wikibugs>	 (03CR) 10Volans: [C:03+2] puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans)
[12:17:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P59326 and previous config saved to /var/cache/conftool/dbconfig/20240403-121759-marostegui.json
[12:20:08] <wikibugs>	 (03PS1) 10Fabfur: benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109)
[12:20:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:26:06] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:26:07] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[12:27:02] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans)
[12:27:08] <wikibugs>	 (03CR) 10Volans: [C:04-1] "I spot few minor corner cases to cover." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi)
[12:31:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T360332)', diff saved to https://phabricator.wikimedia.org/P59327 and previous config saved to /var/cache/conftool/dbconfig/20240403-123133-arnaudb.json
[12:31:36] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[12:31:37] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[12:31:49] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[12:31:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T360332)', diff saved to https://phabricator.wikimedia.org/P59328 and previous config saved to /var/cache/conftool/dbconfig/20240403-123156-arnaudb.json
[12:32:19] <hashar>	 I am going to upgrade the CI Jenkins
[12:33:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T356166)', diff saved to https://phabricator.wikimedia.org/P59329 and previous config saved to /var/cache/conftool/dbconfig/20240403-123306-marostegui.json
[12:33:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[12:33:10] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:33:17] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[12:33:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[12:33:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T356166)', diff saved to https://phabricator.wikimedia.org/P59330 and previous config saved to /var/cache/conftool/dbconfig/20240403-123329-marostegui.json
[12:34:40] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1039.eqiad.wmnet with OS bookworm
[12:34:50] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1039.eqiad.wmnet with...
[12:35:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:36:40] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[12:36:46] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Looks a bit hackish - but I trust your thesis on how blubber would copy everything around again potentially. So this LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[12:37:33] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott)
[12:41:53] * hashar !log Upgrading CI Jenkins # T360759
[12:42:14] * Lucas_WMDE confused at !log in /me message
[12:42:20] <hashar>	 OH MY
[12:42:22] <hashar>	 well spotted
[12:42:25] <Lucas_WMDE>	 :D
[12:42:26] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] k8s: Enable audit logging in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:42:29] <hashar>	 !log Upgrading CI Jenkins # T360759
[12:42:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:32] <stashbot>	 T360759: Jenkins core security advisory - 2024-03-20 - https://phabricator.wikimedia.org/T360759
[12:42:49] <hashar>	 well hmm
[12:42:52] <hashar>	 apparently it managed to start
[12:43:13] * hashar claims success
[12:45:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T356166)', diff saved to https://phabricator.wikimedia.org/P59332 and previous config saved to /var/cache/conftool/dbconfig/20240403-124550-marostegui.json
[12:45:54] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[12:47:55] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LG thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol)
[12:51:23] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Remove profile::pki::client's specific hiera config [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey)
[12:52:12] <wikibugs>	 (03CR) 10Majavah: "No, this is needed for PCC runs for wikiproduction hosts..." [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey)
[12:52:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] analytics_cluster::coordinator: Configure Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:53:14] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] "There is already a value in common.yaml, it should be fine to just use that one, no? I think it is confusing to keep two values.." [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey)
[12:55:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol)
[12:55:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T360332)', diff saved to https://phabricator.wikimedia.org/P59333 and previous config saved to /var/cache/conftool/dbconfig/20240403-125521-arnaudb.json
[12:55:28] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[12:55:39] <wikibugs>	 (03CR) 10Majavah: "I don't think namespaced keys are looked up from common.yaml in production, but I might be wrong?" [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey)
[12:55:46] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9684225 (10MoritzMuehlenhoff)
[12:58:35] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] "I was convinced they were, but then I discovered https://phabricator.wikimedia.org/T209265. This task unveils horrible holes in my puppet " [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey)
[12:58:42] <Dreamy_Jazz>	 jouncebot: next
[12:58:42] <jouncebot>	 In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1300)
[12:59:06] <Dreamy_Jazz>	 Considering there is no patches in the window, I want to do a security deploy.
[12:59:13] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics: Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747#9684238 (10fgiunchedi) Moving off Q4 board since we have hw in capex spreadsheet and it'll be coming next FY
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:18] <Dreamy_Jazz>	 \o
[13:00:21] <Lucas_WMDE>	 Dreamy_Jazz: go ahead
[13:00:23] <Dreamy_Jazz>	 Thanks.
[13:00:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P59334 and previous config saved to /var/cache/conftool/dbconfig/20240403-130058-marostegui.json
[13:02:23] <wikibugs>	 (03PS1) 10Elukey: profile::pki::client: re-introduce fake auth token [labs/private] - 10https://gerrit.wikimedia.org/r/1016764 (https://phabricator.wikimedia.org/T360595)
[13:03:22] <wikibugs>	 (03PS2) 10Fabfur: benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109)
[13:03:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] Decommission an-coord100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol)
[13:03:32] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] profile::pki::client: re-introduce fake auth token [labs/private] - 10https://gerrit.wikimedia.org/r/1016764 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey)
[13:04:07] <elukey>	 taavi: ok now I am going to stop messing with deployment-prep I promise, thanks for the patience
[13:05:04] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[13:05:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:06:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: disable pint promql/series for EnvoyRuntimeAdminOverrides [alerts] - 10https://gerrit.wikimedia.org/r/1016786 (https://phabricator.wikimedia.org/T359633)
[13:06:26] <Dreamy_Jazz>	 I have two security patches to deploy. I will say once I'm done.
[13:10:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P59335 and previous config saved to /var/cache/conftool/dbconfig/20240403-131029-arnaudb.json
[13:12:53] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483)
[13:13:06] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891#9684341 (10fnegri) It's very slow for me as well, I hadn't opened it in a while but it was barely usable both yesterday and today.  ` ~ $ curl -o /...
[13:14:08] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483)
[13:15:38] <moritzm>	 !log installing tiff security updates
[13:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P59336 and previous config saved to /var/cache/conftool/dbconfig/20240403-131606-marostegui.json
[13:16:33] <jinxer-wm>	 (KubernetesCalicoDown) firing: (78) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:16:55] <jinxer-wm>	 (ProbeDown) firing: Service miscweb1003:30443 has failed probes (http_dbtree_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:17:29] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:17:40] <jinxer-wm>	 (CalicoTyphaDown) firing: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown
[13:17:47] * sukhe here for the eventual pag.e I guess!
[13:17:57] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:18:01] <sukhe>	 ha
[13:18:09] <hnowlan>	  here
[13:18:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:18:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:18:20] <sukhe>	 ACKed
[13:18:27] <godog>	 here too, thank you sukhe 
[13:18:45] <godog>	 calico in trouble maybe? 13:16 -jinxer-wm:#wikimedia-operations- (KubernetesCalicoDown) firing: (78)
[13:19:00] <mutante>	 here..ugh
[13:19:31] <hnowlan>	 yeah calico or typha
[13:19:34] <wikibugs>	 (03PS9) 10Elukey: Rework the amd-pytorch22's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638)
[13:20:26] <elukey>	 wow zero typha containers running? 
[13:20:27] <hnowlan>	 calico pods are in crashloopbackoff 
[13:20:29] <hnowlan>	 yeahhhh
[13:20:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (5) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:21:00] <godog>	 not sure what the next best action here is?
[13:21:03] <sukhe>	 sharp increase in 5xx
[13:21:15] <hnowlan>	 bird/confd is not live: Service confd is not running. 
[13:21:33] <sukhe>	 ouch 
[13:21:33] <jinxer-wm>	 (KubernetesCalicoDown) firing: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:21:38] <hnowlan>	 jayme, akosiaris: are you about? 
[13:21:55] <jinxer-wm>	 (ProbeDown) firing: (12) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:22:02] <taavi>	 here too
[13:22:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (2) p75 latency high: eqiad mw-api-ext (k8s) 21.31s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:22:16] <effie>	 people lets take this to -sre
[13:22:25] <effie>	 as it looks very very very bad 
[13:22:30] <jinxer-wm>	 (ProbeDown) firing: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:22:53] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:22:56] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[13:22:57] <jinxer-wm>	 (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:23:06] <jinxer-wm>	 (MediaWikiEditFailures) firing: (2) Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[13:23:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:23:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: (5) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:23:20] <sukhe>	 !incidents
[13:23:20] <sirenbot>	 4556 (ACKED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[13:23:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:24:43] <jinxer-wm>	 (VarnishUnavailable) firing: (2) varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[13:24:44] <jinxer-wm>	 (HaproxyUnavailable) firing: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[13:24:49] <sukhe>	 this is a fun one
[13:25:02] <sukhe>	 ACKed all
[13:25:16] <Dreamy_Jazz>	 Currently deploying a security fix but my internet went out.
[13:25:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P59337 and previous config saved to /var/cache/conftool/dbconfig/20240403-132536-arnaudb.json
[13:25:42] <stashbot>	 arnaudb@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[13:25:51] <urbanecm>	 Dreamy_Jazz: not sure how far your backscroll go, but there is an incident ATM
[13:25:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (11) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:25:56] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[13:26:01] <Dreamy_Jazz>	 Oh. I see.
[13:26:21] <Dreamy_Jazz>	 I'm not sure if my console is actually still connected, so no idea if the security deploy has errored out or is still continuing.
[13:26:24] <TheresNoTime>	 Dreamy_Jazz: what were you deploying?
[13:26:29] <Dreamy_Jazz>	 A security patch
[13:26:30] <Lucas_WMDE>	 I don’t see a running scap, at least
[13:26:33] <jinxer-wm>	 (CalicoKubeControllersDown) firing: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[13:26:33] <jinxer-wm>	 (KubernetesCalicoDown) firing: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:26:53] <Dreamy_Jazz>	 Using deploy_security.py
[13:26:55] <jinxer-wm>	 (ProbeDown) firing: (12) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:26:56] <Lucas_WMDE>	 nor a login session in `who`
[13:27:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (3) p75 latency high: eqiad mw-api-ext (k8s) 8.301s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:27:20] <TheresNoTime>	 Dreamy_Jazz: task # ? I doubt its related but..
[13:27:30] <jinxer-wm>	 (ProbeDown) firing: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:27:31] <Lucas_WMDE>	 but we should probably back off for the incident anyway
[13:27:36] <Dreamy_Jazz>	 https://phabricator.wikimedia.org/T361479
[13:27:36] <jinxer-wm>	 (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[13:27:50] <sukhe>	 !incidents
[13:27:50] <sirenbot>	 4556 (ACKED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[13:27:50] <sirenbot>	 4557 (ACKED)  [2x] VarnishUnavailable global sre (varnish-text)
[13:27:50] <sirenbot>	 4558 (ACKED)  [2x] HaproxyUnavailable cache_text global sre ()
[13:27:51] <sirenbot>	 4559 (UNACKED)  GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad)
[13:27:54] <sukhe>	 !ack 4559
[13:27:55] <sirenbot>	 4559 (ACKED)  GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad)
[13:27:57] <jinxer-wm>	 (ProbeDown) firing: (16) Service eventgate-analytics:4592 has failed probes (http_eventgate-analytics_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:28:30] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:28:52] <Dreamy_Jazz>	 My internet went out after I saw messages related to the incident, so I don't think it is related.
[13:28:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:29:40] <wikibugs>	 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9684390 (10Jhancock.wm)
[13:29:51] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:30:04] <sukhe>	 !incidents
[13:30:04] <sirenbot>	 4556 (ACKED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[13:30:05] <sirenbot>	 4557 (ACKED)  [2x] VarnishUnavailable global sre (varnish-text)
[13:30:05] <sirenbot>	 4558 (ACKED)  [2x] HaproxyUnavailable cache_text global sre ()
[13:30:05] <sirenbot>	 4559 (ACKED)  GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad)
[13:30:05] <sirenbot>	 4560 (UNACKED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[13:30:08] <sukhe>	 !ack 4560
[13:30:08] <sirenbot>	 4560 (ACKED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[13:30:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (19) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:31:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T356166)', diff saved to https://phabricator.wikimedia.org/P59338 and previous config saved to /var/cache/conftool/dbconfig/20240403-133113-marostegui.json
[13:31:15] <jinxer-wm>	 (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:31:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[13:31:18] <stashbot>	 marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[13:31:19] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[13:31:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[13:31:33] <jinxer-wm>	 (KubernetesCalicoDown) firing: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:31:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[13:31:48] <jinxer-wm>	 (KubernetesCalicoDown) firing: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:31:51] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:31:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[13:31:55] <jinxer-wm>	 (ProbeDown) firing: (12) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:32:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T356166)', diff saved to https://phabricator.wikimedia.org/P59339 and previous config saved to /var/cache/conftool/dbconfig/20240403-133200-marostegui.json
[13:32:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (4) p75 latency high: eqiad mw-api-ext (k8s) 2.451s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:32:30] <jinxer-wm>	 (ProbeDown) resolved: (30) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:32:36] <jinxer-wm>	 (GatewayBackendErrorsHigh) firing: (3) rest-gateway: elevated 5xx errors from page-analytics_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it  - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[13:32:40] <jinxer-wm>	 (CalicoTyphaDown) resolved: Too few (1) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown
[13:32:46] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9684402 (10Reedy)
[13:32:57] <jinxer-wm>	 (ProbeDown) resolved: (18) Service citoid:4003 has failed probes (http_citoid_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:33:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (5) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 10.33% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:33:30] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (7) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:34:43] <jinxer-wm>	 (VarnishUnavailable) resolved: (2) varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[13:34:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[13:34:51] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: (9) ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:35:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (18) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:35:56] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[13:35:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: ...
[13:36:02] <jinxer-wm>	 Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[13:36:15] <jinxer-wm>	 (MediaWikiMemcachedHighErrorRate) resolved: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:36:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T356166)', diff saved to https://phabricator.wikimedia.org/P59340 and previous config saved to /var/cache/conftool/dbconfig/20240403-133619-marostegui.json
[13:36:22] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[13:36:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483) (owner: 10Alexandros Kosiaris)
[13:36:25] <Dreamy_Jazz>	 Is the issue related to Thumbor specifically?
[13:36:28] <hnowlan>	 no
[13:36:33] <jinxer-wm>	 (CalicoKubeControllersDown) resolved: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[13:36:33] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:36:51] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: (5) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:36:55] <jinxer-wm>	 (ProbeDown) resolved: (12) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:37:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: (4) p75 latency high: eqiad mw-api-ext (k8s) 1.496s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:37:18] <Dreamy_Jazz>	 Okay. Thanks. The MediaModeration dashboard suggested issues since 7am today
[13:37:21] <Dreamy_Jazz>	 https://grafana.wikimedia.org/d/STSXVVdSk/mediamoderation-photodna-stats?orgId=1&refresh=5m&var-wiki=commonswiki
[13:37:25] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483) (owner: 10Alexandros Kosiaris)
[13:37:36] <jinxer-wm>	 (GatewayBackendErrorsHigh) firing: (3) rest-gateway: elevated 5xx errors from page-analytics_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it  - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[13:37:56] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[13:38:06] <jinxer-wm>	 (MediaWikiEditFailures) resolved: (2) Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[13:38:30] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:38:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:39:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Thanks for this, let us know when you are ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[13:39:51] <jinxer-wm>	 (ATSBackendErrorsHigh) resolved: (9) ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:40:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T360332)', diff saved to https://phabricator.wikimedia.org/P59341 and previous config saved to /var/cache/conftool/dbconfig/20240403-134044-arnaudb.json
[13:40:47] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[13:40:47] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[13:40:49] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[13:40:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (17) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:40:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[13:41:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[13:41:29] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[13:41:34] <Dreamy_Jazz>	 It seems my security deploy is half deployed
[13:41:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T360332)', diff saved to https://phabricator.wikimedia.org/P59342 and previous config saved to /var/cache/conftool/dbconfig/20240403-134136-arnaudb.json
[13:41:49] <Dreamy_Jazz>	 The code is applied but the patch file isn't listed in /srv/patches
[13:41:51] <jinxer-wm>	 (ATSBackendErrorsHigh) resolved: (5) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:43:47] <TheresNoTime>	 I'm not sure if retrying the deploy again will error out or just do the bits which were left — I guess leaving it in its current state is safe enough whilst things are a bit unstable
[13:43:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:44:15] <Lucas_WMDE>	 yeah, IMHO it’s best not to do anything right now until the other incident is resolved
[13:44:47] <jinxer-wm>	 (HelmReleaseBadStatus) firing: (4) Helm release mw-api-ext/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:45:12] <taavi>	 Dreamy_Jazz: yes, please do not do deploy right now without coordinating with #-sre
[13:45:23] <Dreamy_Jazz>	 Okay.
[13:45:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (15) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:47:53] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:50:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (9) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:51:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P59343 and previous config saved to /var/cache/conftool/dbconfig/20240403-135126-marostegui.json
[13:55:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1400)
[14:00:26] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:00:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:05:30] <wikibugs>	 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9684511 (10lmata)
[14:06:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw
[14:06:30] <wikibugs>	 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9684516 (10lmata)
[14:06:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P59344 and previous config saved to /var/cache/conftool/dbconfig/20240403-140634-marostegui.json
[14:07:32] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 2527
[14:08:08] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2527
[14:08:46] <wikibugs>	 (03PS1) 10Hnowlan: calico-typha: double memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794
[14:09:22] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:09:29] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:11:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[14:11:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad
[14:15:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:16:46] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS bookworm
[14:17:02] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1040.eqiad.wmnet...
[14:17:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1013968 (owner: 10Majavah)
[14:17:36] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1040: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016795 (https://phabricator.wikimedia.org/T319184)
[14:17:40] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "500 would also be ok, but 600 is fine for me as well, we can always revisit later on." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan)
[14:17:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[14:17:56] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:18:53] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[14:18:55] <wikibugs>	 (03CR) 10Jelto: ""Bug: T361706" could be added to the commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan)
[14:19:22] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] calico-typha: double memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan)
[14:20:42] <wikibugs>	 (03CR) 10David Caro: [C:03+1] cloudvirt1040: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016795 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[14:20:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:21:01] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] calico-typha: double memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan)
[14:21:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T356166)', diff saved to https://phabricator.wikimedia.org/P59345 and previous config saved to /var/cache/conftool/dbconfig/20240403-142142-marostegui.json
[14:21:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[14:21:46] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[14:21:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1040: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016795 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[14:21:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[14:22:22] <wikibugs>	 (03Merged) 10jenkins-bot: calico-typha: double memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan)
[14:22:56] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:24:13] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1040
[14:24:37] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1040
[14:24:55] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684602 (10aborrero)
[14:26:03] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:26:33] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:27:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:27:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T360332)', diff saved to https://phabricator.wikimedia.org/P59346 and previous config saved to /var/cache/conftool/dbconfig/20240403-142709-arnaudb.json
[14:27:12] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[14:27:27] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:30:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:31:18] <wikibugs>	 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9684629 (10andrea.denisse) a:03andrea.denisse
[14:31:38] <wikibugs>	 (03CR) 10Fabfur: [V:03+1 C:03+2] benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[14:31:49] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage
[14:32:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff)
[14:34:37] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage
[14:37:22] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:49] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-internal&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[14:40:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:41:45] <wikibugs>	 (03PS1) 10Elukey: role::builder: add the somebody user's UID [puppet] - 10https://gerrit.wikimedia.org/r/1016798 (https://phabricator.wikimedia.org/T360638)
[14:42:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P59347 and previous config saved to /var/cache/conftool/dbconfig/20240403-144217-arnaudb.json
[14:44:00] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1795/co" [puppet] - 10https://gerrit.wikimedia.org/r/1016798 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[14:44:01] <logmsgbot>	 !log dreamyjazz@deploy1002 Started scap: (no justification provided)
[14:44:58] <Dreamy_Jazz>	 Didn't provide a reason, but this is related to deploying security patch T361479
[14:45:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:46:14] <sukhe>	 I /win 14
[14:46:58] <wikibugs>	 (03PS4) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855)
[14:46:58] <wikibugs>	 (03PS5) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855)
[14:46:58] <wikibugs>	 (03PS1) 10Andrew Bogott: role:cinder_backups: include full env scripts [puppet] - 10https://gerrit.wikimedia.org/r/1016799
[14:46:59] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "wmcs-backup: use novaobserver instead of novaadmin" [puppet] - 10https://gerrit.wikimedia.org/r/1016800
[14:50:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:52:33] <wikibugs>	 (03PS2) 10Andrew Bogott: role:cinder_backups: include full env scripts [puppet] - 10https://gerrit.wikimedia.org/r/1016799
[14:52:34] <wikibugs>	 (03PS2) 10Andrew Bogott: Revert "wmcs-backup: use novaobserver instead of novaadmin" [puppet] - 10https://gerrit.wikimedia.org/r/1016800
[14:52:34] <wikibugs>	 (03PS5) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855)
[14:52:34] <wikibugs>	 (03PS6) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855)
[14:54:23] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016799 (owner: 10Andrew Bogott)
[14:57:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P59349 and previous config saved to /var/cache/conftool/dbconfig/20240403-145725-arnaudb.json
[14:59:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] role:cinder_backups: include full env scripts [puppet] - 10https://gerrit.wikimedia.org/r/1016799 (owner: 10Andrew Bogott)
[15:00:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "wmcs-backup: use novaobserver instead of novaadmin" [puppet] - 10https://gerrit.wikimedia.org/r/1016800 (owner: 10Andrew Bogott)
[15:01:23] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync
[15:01:34] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS bookworm
[15:01:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync
[15:01:47] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1040.eqiad.wmnet with...
[15:02:22] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:49] <mutante>	 !incidents
[15:02:50] <sirenbot>	 4559 (RESOLVED)  GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad)
[15:02:50] <sirenbot>	 4561 (RESOLVED)  [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[15:02:50] <sirenbot>	 4560 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[15:02:50] <sirenbot>	 4558 (RESOLVED)  [2x] HaproxyUnavailable cache_text global sre ()
[15:02:51] <sirenbot>	 4557 (RESOLVED)  [2x] VarnishUnavailable global sre (varnish-text)
[15:02:51] <sirenbot>	 4556 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[15:02:56] <logmsgbot>	 !log dreamyjazz@deploy1002 Finished scap: (no justification provided) (duration: 18m 54s)
[15:03:48] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2098.codfw.wmnet with reason: restart of mysqld
[15:04:02] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2098.codfw.wmnet with reason: restart of mysqld
[15:04:51] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott)
[15:06:17] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] role::builder: add the somebody user's UID [puppet] - 10https://gerrit.wikimedia.org/r/1016798 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[15:06:48] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Rework the amd-pytorch22's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[15:07:36] <jinxer-wm>	 (GatewayBackendErrorsHigh) resolved: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[15:08:25] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 10decommission-hardware, 13Patch-For-Review: 14decommission db2100.codfw.wmnet - 14https://phabricator.wikimedia.org/T361584#9684775 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:11:04] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986)
[15:11:48] <wikibugs>	 (03PS6) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614
[15:12:15] <wikibugs>	 (03CR) 10Ayounsi: "Thanks, addressed" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi)
[15:12:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T360332)', diff saved to https://phabricator.wikimedia.org/P59350 and previous config saved to /var/cache/conftool/dbconfig/20240403-151233-arnaudb.json
[15:12:36] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[15:12:37] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[15:12:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[15:12:52] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance
[15:13:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance
[15:13:29] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance
[15:13:42] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance
[15:13:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2114 (T360332)', diff saved to https://phabricator.wikimedia.org/P59351 and previous config saved to /var/cache/conftool/dbconfig/20240403-151349-arnaudb.json
[15:16:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T360332)', diff saved to https://phabricator.wikimedia.org/P59352 and previous config saved to /var/cache/conftool/dbconfig/20240403-151614-arnaudb.json
[15:17:47] <wikibugs>	 (03PS7) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986)
[15:22:15] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:22:35] <Dreamy_Jazz>	 !log Starting MediaModeration scanning script again - It crashed due to the outage
[15:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:55] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:25:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:26:54] <wikibugs>	 (03PS6) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855)
[15:26:54] <wikibugs>	 (03PS7) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855)
[15:26:54] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-backup.py: replace image_id with image_info in a few more places [puppet] - 10https://gerrit.wikimedia.org/r/1016806 (https://phabricator.wikimedia.org/T359192)
[15:27:54] <wikibugs>	 (03PS1) 10Elukey: amd-pytorch22: move comments to a README file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1016807 (https://phabricator.wikimedia.org/T360638)
[15:30:13] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] amd-pytorch22: move comments to a README file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1016807 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[15:30:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:30:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmcs-backup.py: replace image_id with image_info in a few more places [puppet] - 10https://gerrit.wikimedia.org/r/1016806 (https://phabricator.wikimedia.org/T359192) (owner: 10Andrew Bogott)
[15:31:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P59353 and previous config saved to /var/cache/conftool/dbconfig/20240403-153121-arnaudb.json
[15:31:31] <wikibugs>	 (03PS8) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986)
[15:32:07] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-backup.py: replace image_id with image_info in a few more places [puppet] - 10https://gerrit.wikimedia.org/r/1016806 (https://phabricator.wikimedia.org/T359192)
[15:32:07] <wikibugs>	 (03PS7) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855)
[15:32:07] <wikibugs>	 (03PS8) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855)
[15:32:49] <jinxer-wm>	 (PuppetDisabled) resolved: Puppet disabled on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-internal&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[15:33:22] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[15:33:25] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[15:35:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:36:26] <wikibugs>	 (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1016807 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[15:37:32] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] amd-pytorch22: move comments to a README file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1016807 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey)
[15:42:01] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] "Looks good - adding +1 for when the -2 is removed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz)
[15:42:44] <wikibugs>	 (03CR) 10Volans: [C:04-1] "Sorry last minute bug spotted, not your fault" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi)
[15:44:06] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] wmcs puppetservers: stop pulling hiera from /etc/puppet/secrets [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott)
[15:44:25] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM, sorry for not spotting this side effect of my change!" [puppet] - 10https://gerrit.wikimedia.org/r/1016806 (https://phabricator.wikimedia.org/T359192) (owner: 10Andrew Bogott)
[15:45:43] <wikibugs>	 (03CR) 10JHathaway: "that looks right, do you folks have a cfssl server in fund raising tech, or can you reach out to ours?" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[15:45:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:46:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P59354 and previous config saved to /var/cache/conftool/dbconfig/20240403-154628-arnaudb.json
[15:48:39] <effie>	 !depool mw-web-ro in eqiad
[15:48:39] <wm-bot>	 for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done
[15:53:42] <logmsgbot>	 !log jiji@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-web-ro,name=eqiad
[16:01:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T360332)', diff saved to https://phabricator.wikimedia.org/P59355 and previous config saved to /var/cache/conftool/dbconfig/20240403-160136-arnaudb.json
[16:01:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance
[16:01:50] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[16:01:52] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance
[16:02:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T360332)', diff saved to https://phabricator.wikimedia.org/P59356 and previous config saved to /var/cache/conftool/dbconfig/20240403-160159-arnaudb.json
[16:04:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T360332)', diff saved to https://phabricator.wikimedia.org/P59357 and previous config saved to /var/cache/conftool/dbconfig/20240403-160425-arnaudb.json
[16:05:11] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[16:05:25] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[16:07:18] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:07:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[16:08:16] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[16:09:41] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott)
[16:12:57] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[16:14:44] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[16:14:47] <jinxer-wm>	 (HelmReleaseBadStatus) firing: (4) Helm release mw-api-ext/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:16:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott)
[16:16:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott)
[16:19:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P59358 and previous config saved to /var/cache/conftool/dbconfig/20240403-161933-arnaudb.json
[16:19:47] <jinxer-wm>	 (HelmReleaseBadStatus) firing: (4) Helm release mw-api-ext/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:24:47] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: (4) Helm release mw-api-ext/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:26:02] <logmsgbot>	 !log jiji@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro,name=eqiad
[16:26:07] <logmsgbot>	 !log jayme@deploy1002 Started scap: (no justification provided)
[16:26:22] <effie>	 !log pooling back mw-web-ro in eqiad
[16:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[16:29:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[16:29:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:29:42] <logmsgbot>	 !log jayme@deploy1002 Finished scap: (no justification provided) (duration: 03m 34s)
[16:29:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:30:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T356166)', diff saved to https://phabricator.wikimedia.org/P59359 and previous config saved to /var/cache/conftool/dbconfig/20240403-163004-marostegui.json
[16:30:09] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[16:30:30] <wikibugs>	 (03CR) 10Dwisehaupt: [V:03+1] "We do not have a cfssl server in our area. However, this community-crm host will live on a prod vps host (cloudvps for the testing host). " [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[16:30:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:32:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59360 and previous config saved to /var/cache/conftool/dbconfig/20240403-163249-root.json
[16:34:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P59361 and previous config saved to /var/cache/conftool/dbconfig/20240403-163440-arnaudb.json
[16:35:36] <wikibugs>	 (03CR) 10Dzahn: "are you not going to use envoy to do the TLS termination and keep apache on http? that's now the pattern that prod services use when they " [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[16:36:19] <wikibugs>	 (03CR) 10Dzahn: "if that was the case you would have something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014605/3/hieradata/role/common/mi" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[16:36:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling restart_daemons on A:maps-master
[16:37:57] <wikibugs>	 (03CR) 10BryanDavis: "For Striker's Docker deployment on the cloudweb* hosts we use the `service::docker` wrapper with `host_network => true` so that the code i" [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (owner: 10Krinkle)
[16:38:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling restart_daemons on A:maps-master
[16:41:04] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: provision and commission logging-hd200[123] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1016368 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite)
[16:42:22] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:45:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:47:22] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:47:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59362 and previous config saved to /var/cache/conftool/dbconfig/20240403-164754-root.json
[16:49:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T360332)', diff saved to https://phabricator.wikimedia.org/P59363 and previous config saved to /var/cache/conftool/dbconfig/20240403-164948-arnaudb.json
[16:49:51] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[16:49:51] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[16:50:04] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[16:50:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T360332)', diff saved to https://phabricator.wikimedia.org/P59364 and previous config saved to /var/cache/conftool/dbconfig/20240403-165011-arnaudb.json
[16:52:22] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:52:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T360332)', diff saved to https://phabricator.wikimedia.org/P59365 and previous config saved to /var/cache/conftool/dbconfig/20240403-165234-arnaudb.json
[16:52:45] <wikibugs>	 (03PS1) 10Volans: tests: fix typos in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016814
[16:54:36] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:54:43] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:56:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:56:47] <Dreamy_Jazz>	 jouncebot: next
[16:56:47] <jouncebot>	 In 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1700)
[16:56:56] <Dreamy_Jazz>	 jouncebot: nowandnext
[16:56:56] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 3 minute(s)
[16:56:56] <jouncebot>	 In 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1700)
[16:59:13] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:59:21] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1700)
[17:00:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:01:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:03:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59366 and previous config saved to /var/cache/conftool/dbconfig/20240403-170300-root.json
[17:03:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol)
[17:03:42] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:04:43] <herron>	 !log performing rolling memory upgrades on prometheus100[56] T360687
[17:04:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:46] <stashbot>	 T360687: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687
[17:05:40] <cwhite>	 as a result of ^^ expect to see gaps on dashboards
[17:07:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P59367 and previous config saved to /var/cache/conftool/dbconfig/20240403-170741-arnaudb.json
[17:08:42] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:10:29] <wikibugs>	 (03CR) 10Krinkle: "Aye, so I did consider that in PS1, but I noticed it also affects the ports being exported. There is no longer port mapping in that case, " [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (owner: 10Krinkle)
[17:10:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:15:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:16:18] <wikibugs>	 (03PS1) 10Jforrester: Centralize API calls in api.js mixin and fix error handling [extensions/WikiLambda] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016778 (https://phabricator.wikimedia.org/T361598)
[17:17:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] logstash_checker.py: Fix _mwdeploy_query for k8s-less realm [puppet] - 10https://gerrit.wikimedia.org/r/1016436 (owner: 10Ahmon Dancy)
[17:18:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59368 and previous config saved to /var/cache/conftool/dbconfig/20240403-171806-root.json
[17:19:40] <wikibugs>	 (03CR) 10Dzahn: "@Urbanecm Could we reboot the stewards machines any time or is something running we should look for?" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn)
[17:22:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P59369 and previous config saved to /var/cache/conftool/dbconfig/20240403-172249-arnaudb.json
[17:24:27] <wikibugs>	 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9685401 (10VRiley-WMF) worked with @herron and added the 32Gig DDR4 2666 to the requested slots. Both servers came back up and reported the correct sizes as expected. Closing...
[17:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 10Observability-Metrics: 14Memory upgrade request for prometheus100[56] - 14https://phabricator.wikimedia.org/T360687#9685402 (10VRiley-WMF) 05Open→03Resolved
[17:25:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:27:28] <wikibugs>	 (03CR) 10Urbanecm: "Absolutely, reboot is okay at any time." [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn)
[17:30:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:33:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59370 and previous config saved to /var/cache/conftool/dbconfig/20240403-173312-root.json
[17:35:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:37:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T360332)', diff saved to https://phabricator.wikimedia.org/P59371 and previous config saved to /var/cache/conftool/dbconfig/20240403-173756-arnaudb.json
[17:38:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[17:38:00] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[17:38:13] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[17:38:15] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[17:38:28] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[17:38:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59372 and previous config saved to /var/cache/conftool/dbconfig/20240403-173835-arnaudb.json
[17:39:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59373 and previous config saved to /var/cache/conftool/dbconfig/20240403-173958-arnaudb.json
[17:42:50] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685463 (10Bethany) This request is approved on my side
[17:45:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:48:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59374 and previous config saved to /var/cache/conftool/dbconfig/20240403-174817-root.json
[17:51:58] <wikibugs>	 (03PS7) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636)
[17:52:34] <mvolz>	 I'm notice citoid had some downtime twice today which is unusual, and I did a deploy this morning :/
[17:52:41] <mvolz>	 also the endpoint is not happy
[17:54:11] <jayme>	 mvolz: hey o/ I did not reach out directly as I thought you're offline, sorry
[17:54:18] <jayme>	 mvolz: just created  https://phabricator.wikimedia.org/T361728
[17:54:21] <mvolz>	 I just got back
[17:54:30] <Dreamy_Jazz>	 jouncebot: nowandnext
[17:54:30] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1700)
[17:54:30] <jouncebot>	 In 0 hour(s) and 5 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800)
[17:54:30] <jouncebot>	 In 0 hour(s) and 5 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800)
[17:55:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P59375 and previous config saved to /var/cache/conftool/dbconfig/20240403-175505-arnaudb.json
[17:55:52] <jayme>	 mvolz: the error seems flaky, so maybe it's something more related to load  - but I did not come around to take a closer look as of now
[17:57:11] <jayme>	 and I gtg unfortunately. If those failures are "real" I'd suggest rolling back for now
[17:57:17] <mvolz>	 jayme: it's a little odd because for citoid itself it was just a package-lock update. It's also feasible Zotero but theoretically citoid should function without it, and for that it was just switching to node 18 
[17:57:47] <mvolz>	 but yeah I can roll back and see how it goes. might do just citoid to start and watch it for a while?
[17:58:51] <mvolz>	 i can see in grafana there is actually two actual downtimes, and the rest of the time "flaky". 
[17:59:16] <jayme>	 yeah. If it still fails after the rollback it might as well that the actual problem is zotero - maybe there is something useful in logs as well, I did not check at all thb
[18:00:04] <jouncebot>	 jnuche and jeena: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800).
[18:00:04] <jouncebot>	 jnuche and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800).
[18:00:04] <wikibugs>	 (03CR) 10Eevans: [C:03+2] restbase: remove decommissioned hosts restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/1016003 (https://phabricator.wikimedia.org/T354561) (owner: 10Eevans)
[18:00:26] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[18:00:38] <mvolz>	 Ok, I'm going to rollback just citoid rn 
[18:01:01] <jayme>	 mvolz: I gtg. Please feel free to reach out to the US colleagues in #wikimedia-serviceops if you need help
[18:02:03] <wikibugs>	 (03PS1) 10Mvolz: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016781 (https://phabricator.wikimedia.org/T361728)
[18:02:37] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016781 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz)
[18:03:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59377 and previous config saved to /var/cache/conftool/dbconfig/20240403-180323-root.json
[18:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016781 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz)
[18:04:51] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[18:04:56] <wikibugs>	 (03CR) 10Scott French: "Many thanks for the review, Riccardo!" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French)
[18:05:13] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[18:05:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:06:06] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:06:41] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply
[18:07:09] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts restbase[1019-1027].eqiad.wmnet
[18:07:36] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[18:08:10] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[18:09:03] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[18:10:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P59378 and previous config saved to /var/cache/conftool/dbconfig/20240403-181013-arnaudb.json
[18:13:20] <logmsgbot>	 !log dreamyjazz Deployed security patch for T361479
[18:14:21] <wikibugs>	 (03CR) 10Dwisehaupt: [V:03+1] "Thanks! I'll have a look at this and test it out." [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[18:15:44] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685551 (10RLazarus) @AndyRussG Welcome back!  - With the information above, I can set up your LDAP access. For your shell access I'll also need the information on [[ https://phabricator.wikimedia.org/m...
[18:15:50] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685552 (10RLazarus) p:05Triage→03Medium a:03RLazarus
[18:24:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.165s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:25:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59379 and previous config saved to /var/cache/conftool/dbconfig/20240403-182520-arnaudb.json
[18:25:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance
[18:25:30] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[18:25:36] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance
[18:25:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T360332)', diff saved to https://phabricator.wikimedia.org/P59380 and previous config saved to /var/cache/conftool/dbconfig/20240403-182543-arnaudb.json
[18:28:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T360332)', diff saved to https://phabricator.wikimedia.org/P59381 and previous config saved to /var/cache/conftool/dbconfig/20240403-182806-arnaudb.json
[18:29:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 951.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:30:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:31:06] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:34:03] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.dns.netbox
[18:35:57] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[1019-1027].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[18:36:06] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:37:01] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[1019-1027].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[18:37:01] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:37:02] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase[1019-1027].eqiad.wmnet
[18:40:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:43:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P59382 and previous config saved to /var/cache/conftool/dbconfig/20240403-184313-arnaudb.json
[18:43:19] <wikibugs>	 (03PS1) 10Eevans: site.pp: cleanup restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/1016829 (https://phabricator.wikimedia.org/T354561)
[18:45:33] <wikibugs>	 (03CR) 10Eevans: [C:03+2] site.pp: cleanup restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/1016829 (https://phabricator.wikimedia.org/T354561) (owner: 10Eevans)
[18:49:32] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:49:36] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:50:04] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission restbase10[19-27] - https://phabricator.wikimedia.org/T361372#9685608 (10Eevans)
[18:52:15] <mvolz>	 The citoid rollback doesn't seemed to have fixed things, so I'm going to rollback Zotero. 
[18:53:05] <mvolz>	 jouncebot: nowandnext
[18:53:05] <jouncebot>	 For the next 0 hour(s) and 6 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800)
[18:53:05] <jouncebot>	 For the next 1 hour(s) and 6 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800)
[18:53:05] <jouncebot>	 In 1 hour(s) and 6 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T2000)
[18:53:32] <mvolz>	 Does anyone care if I do that now?
[18:57:42] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:57:46] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:57:57] <wikibugs>	 (03PS1) 10Mvolz: Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016783 (https://phabricator.wikimedia.org/T361728)
[18:58:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P59383 and previous config saved to /var/cache/conftool/dbconfig/20240403-185821-arnaudb.json
[18:58:39] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016783 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz)
[18:59:35] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016783 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz)
[19:00:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:01:54] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[19:02:09] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[19:02:32] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[19:03:21] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[19:03:58] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[19:04:31] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[19:05:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:06:09] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:06:13] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:10:51] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:13:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T360332)', diff saved to https://phabricator.wikimedia.org/P59384 and previous config saved to /var/cache/conftool/dbconfig/20240403-191328-arnaudb.json
[19:13:31] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance
[19:13:32] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[19:13:44] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance
[19:13:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59385 and previous config saved to /var/cache/conftool/dbconfig/20240403-191351-arnaudb.json
[19:16:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59386 and previous config saved to /var/cache/conftool/dbconfig/20240403-191615-arnaudb.json
[19:16:33] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:16:38] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:16:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: 14eqiad: (1) VM for MySQL Orchestrator - 14https://phabricator.wikimedia.org/T332718#9685691 (10jhathaway) 05Open→03Declined 14part of bookworm upgrade sprint week, but I ran out of time, not currently prioritizing this work.
[19:18:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] stewards: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn)
[19:18:27] <wikibugs>	 (03PS2) 10Dzahn: stewards: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013649
[19:23:54] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:23:58] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:25:38] <wikibugs>	 (03CR) 10Dzahn: "yea, it would be more in line with the way other services do this. I am happy to show examples to follow and creating certs is much simple" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[19:27:06] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] stewards: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn)
[19:29:08] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:29:12] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:31:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P59387 and previous config saved to /var/cache/conftool/dbconfig/20240403-193122-arnaudb.json
[19:31:24] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:31:28] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:33:39] <wikibugs>	 (03CR) 10Elukey: "Almost ready to go, let's remove config.yaml and rebase to see if everything looks good." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos)
[19:35:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742 (10AndyRussG) 03NEW
[19:38:44] <mutante>	 !log stewards2001 - reboot to switch from iptables to nftables
[19:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742#9685860 (10AndyRussG)
[19:39:03] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685861 (10AndyRussG)
[19:45:19] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685872 (10AndyRussG) >>! In T361665#9685550, @RLazarus wrote: > @AndyRussG Welcome back!  Heyyy thanks so much!!!! :) :)   > - With the information above, I can set up your LDAP access. For your shell...
[19:46:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P59388 and previous config saved to /var/cache/conftool/dbconfig/20240403-194630-arnaudb.json
[19:51:55] <wikibugs>	 (03CR) 10Dzahn: [V:03+2] "root@stewards2001:/# nft list table inet base" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T2000).
[20:00:05] <jouncebot>	 phuedx and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] * James_F waves.
[20:00:24] <mutante>	 !log stewards1001 - rebooting to switch from iptables to nftables
[20:00:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:33] <dmartin-WMF>	 Hi I'm here
[20:01:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59390 and previous config saved to /var/cache/conftool/dbconfig/20240403-200137-arnaudb.json
[20:01:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance
[20:01:47] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[20:01:53] <cjming>	 hi hi
[20:01:54] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance
[20:02:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59391 and previous config saved to /var/cache/conftool/dbconfig/20240403-200201-arnaudb.json
[20:02:05] <James_F>	 I can deploy if needed.
[20:02:16] <cjming>	 James_F: i was just going to ask that
[20:02:20] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] "machines rebooted, confirmed with "nft list table inet base" the base rules are there and "lsmod | grep tables" shows after reboot there a" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn)
[20:02:42] <James_F>	 But verification would be best done by phuedx.
[20:02:44] <James_F>	 Eh.
[20:02:48] <James_F>	 Let's do my one, at least.
[20:03:02] <cjming>	 I think David can verify the config patch
[20:03:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016778 (https://phabricator.wikimedia.org/T361598) (owner: 10Jforrester)
[20:03:14] <James_F>	 Ack.
[20:03:15] <dmartin-WMF>	 Yes
[20:04:19] <cjming>	 cool - thanks!
[20:04:21] <dmartin-WMF>	 (Except to be honest I'm not sure anymore what the verification step entails)
[20:04:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59392 and previous config saved to /var/cache/conftool/dbconfig/20240403-200425-arnaudb.json
[20:06:13] * James_F twiddles thumbs waiting for merge.
[20:06:54] <dmartin-WMF>	 Note that there is currently a Merge conflict on our patch
[20:07:18] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:08:04] <wikibugs>	 (03Merged) 10jenkins-bot: Centralize API calls in api.js mixin and fix error handling [extensions/WikiLambda] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016778 (https://phabricator.wikimedia.org/T361598) (owner: 10Jforrester)
[20:08:52] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1016778|Centralize API calls in api.js mixin and fix error handling (T361598 T315432)]]
[20:09:01] <stashbot>	 T361598: Adapt front-end to understand new errors after returning HTTP error codes - https://phabricator.wikimedia.org/T361598
[20:09:02] <stashbot>	 T315432: Consolidate all in-Vue API calls into our mixins/api.js file - https://phabricator.wikimedia.org/T315432
[20:10:31] <mutante>	 urbanecm: I think instead of wikidev we can do one better and use the group "stewards-users"
[20:10:38] <mutante>	 uid=13367(urbanecm) gid=500(wikidev) groups=500(wikidev),751(stewards-users)
[20:10:40] <wikibugs>	 (03PS8) 10Jforrester: Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci)
[20:10:40] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:10:44] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:11:18] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1016778|Centralize API calls in api.js mixin and fix error handling (T361598 T315432)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:12:19] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Continuing with sync
[20:12:47] <James_F>	 dmartin-WMF: OK, the API change is going out now, so I'll be able to sling out the metrics config change in ~5 minutes' time.
[20:14:01] <wikibugs>	 (03Abandoned) 10Jforrester: testwikis wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016069 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot)
[20:15:10] <wikibugs>	 (03CR) 10Jforrester: Set "s3" as the default section name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 (owner: 10Aaron Schulz)
[20:15:24] <wikibugs>	 (03PS2) 10Dzahn: stewards: let puppet create /srv/exports [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202)
[20:16:15] <wikibugs>	 (03PS2) 10Jforrester: component: Add SandboxLink to Portuguese Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto)
[20:16:19] <wikibugs>	 (03CR) 10Jforrester: component: Add SandboxLink to Portuguese Wikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto)
[20:16:30] <wikibugs>	 (03CR) 10Dzahn: "Amended! But I think we can do better than keep using the old wikidev "hack" and use the proper group "stewards-users" that we already hav" [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[20:18:45] <James_F>	 Of course, as soon as I say '5 mins' scap then just stops responding.
[20:19:06] <dmartin-WMF>	 Right
[20:19:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P59393 and previous config saved to /var/cache/conftool/dbconfig/20240403-201933-arnaudb.json
[20:19:46] <James_F>	 Meh, 5 mins just to update the mw-k8s main pods.
[20:21:16] <urbanecm>	 mutante: using that group works as well for me. I suggested wikidev, as that's what we use for the repo with the app itself. 
[20:22:30] <urbanecm>	 Using stewards-users might cause problems if a root changes something there, as they'd have to use sudo (and the file might be easily owned by other group)
[20:23:50] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1016778|Centralize API calls in api.js mixin and fix error handling (T361598 T315432)]] (duration: 14m 58s)
[20:23:54] <stashbot>	 T361598: Adapt front-end to understand new errors after returning HTTP error codes - https://phabricator.wikimedia.org/T361598
[20:23:55] <stashbot>	 T315432: Consolidate all in-Vue API calls into our mixins/api.js file - https://phabricator.wikimedia.org/T315432
[20:24:47] <James_F>	 Finally.
[20:25:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci)
[20:25:49] <wikibugs>	 (03Merged) 10jenkins-bot: Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci)
[20:26:20] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:992223|Update the WikiLambda instrumentation to use core interaction events (T350497)]]
[20:26:29] <stashbot>	 T350497: Update the WikiLambda instrumentation to use core interaction events - https://phabricator.wikimedia.org/T350497
[20:28:52] <logmsgbot>	 !log jforrester@deploy1002 sfaci and jforrester: Backport for [[gerrit:992223|Update the WikiLambda instrumentation to use core interaction events (T350497)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:29:22] <James_F>	 dmartin-WMF: OK, it's live on the debug servers – can you test if it works from your end?
[20:29:58] <dmartin-WMF>	 Sorry, but please remind me what I should do to verify a change of this sort (only involving ext-EventStreamConfig.php)
[20:30:08] <dmartin-WMF>	 You mean to generate an event in our UI?
[20:31:02] <James_F>	 Yes, it seems to not be erroring at least.
[20:31:17] <James_F>	 But how to tell if they're going into the metrics platform?
[20:31:42] <dmartin-WMF>	 I don't know, sorry
[20:32:29] <dmartin-WMF>	 Has the new instruments patch been deployed?  I didn't think so
[20:33:10] <James_F>	 dmartin-WMF: It's on the debug server and holding until we can verify.
[20:34:22] <James_F>	 OK, it seems good enough for me; in the absence of Sam, I'll continue.
[20:34:23] <logmsgbot>	 !log jforrester@deploy1002 sfaci and jforrester: Continuing with sync
[20:34:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P59394 and previous config saved to /var/cache/conftool/dbconfig/20240403-203440-arnaudb.json
[20:34:43] <dmartin-WMF>	 Good; thanks
[20:45:24] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:992223|Update the WikiLambda instrumentation to use core interaction events (T350497)]] (duration: 19m 03s)
[20:45:27] <stashbot>	 T350497: Update the WikiLambda instrumentation to use core interaction events - https://phabricator.wikimedia.org/T350497
[20:45:44] <James_F>	 All right, all done.
[20:46:02] <dmartin-WMF>	 Excellent.  Thanks again James!
[20:47:05] <wikibugs>	 (03PS3) 10Jforrester: component: Add SandboxLink to Portuguese Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto)
[20:47:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto)
[20:47:56] <wikibugs>	 (03Merged) 10jenkins-bot: component: Add SandboxLink to Portuguese Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto)
[20:48:25] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1015649|component: Add SandboxLink to Portuguese Wikiquote (T361447)]]
[20:48:28] <stashbot>	 T361447: Add SandboxLink to ptwikiquote - https://phabricator.wikimedia.org/T361447
[20:49:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59395 and previous config saved to /var/cache/conftool/dbconfig/20240403-204949-arnaudb.json
[20:49:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance
[20:49:54] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[20:50:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance
[20:50:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59396 and previous config saved to /var/cache/conftool/dbconfig/20240403-205014-arnaudb.json
[20:50:32] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:50:36] <logmsgbot>	 !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:50:53] <logmsgbot>	 !log jforrester@deploy1002 ederporto and jforrester: Backport for [[gerrit:1015649|component: Add SandboxLink to Portuguese Wikiquote (T361447)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:51:57] <logmsgbot>	 !log jforrester@deploy1002 ederporto and jforrester: Continuing with sync
[20:51:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 1 VM request for postfix mta-out - https://phabricator.wikimedia.org/T361750 (10jhathaway) 03NEW
[20:52:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59397 and previous config saved to /var/cache/conftool/dbconfig/20240403-205240-arnaudb.json
[20:58:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mta-out - https://phabricator.wikimedia.org/T361750#9686110 (10jhathaway) a:03jhathaway
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T2100)
[21:01:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mta-out - https://phabricator.wikimedia.org/T361750#9686125 (10jhathaway) p:05Triage→03Medium
[21:02:44] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1015649|component: Add SandboxLink to Portuguese Wikiquote (T361447)]] (duration: 14m 18s)
[21:02:47] <stashbot>	 T361447: Add SandboxLink to ptwikiquote - https://phabricator.wikimedia.org/T361447
[21:04:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9686143 (10jhathaway)
[21:05:43] <wikibugs>	 (03PS3) 10Dzahn: stewards: puppetize steward-onboarder config file and paths [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202)
[21:07:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P59398 and previous config saved to /var/cache/conftool/dbconfig/20240403-210747-arnaudb.json
[21:15:28] <wikibugs>	 (03CR) 10Dzahn: stewards: puppetize steward-onboarder config file and paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[21:22:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] aphlict: switch envoy cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[21:22:38] <wikibugs>	 (03PS2) 10Dzahn: aphlict: switch envoy cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413)
[21:22:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P59399 and previous config saved to /var/cache/conftool/dbconfig/20240403-212255-arnaudb.json
[21:26:20] <wikibugs>	 (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[21:37:26] <wikibugs>	 (03PS2) 10Cwhite: spicerack: update logging-eqiad host to logging-hd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517)
[21:38:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59400 and previous config saved to /var/cache/conftool/dbconfig/20240403-213802-arnaudb.json
[21:38:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance
[21:38:06] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[21:38:18] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance
[21:38:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T360332)', diff saved to https://phabricator.wikimedia.org/P59401 and previous config saved to /var/cache/conftool/dbconfig/20240403-213825-arnaudb.json
[21:38:42] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] spicerack: update logging-eqiad host to logging-hd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite)
[21:40:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T360332)', diff saved to https://phabricator.wikimedia.org/P59402 and previous config saved to /var/cache/conftool/dbconfig/20240403-214048-arnaudb.json
[21:48:15] <wikibugs>	 (03PS2) 10Dzahn: delete aphlict.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013417 (https://phabricator.wikimedia.org/T360413)
[21:48:30] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] delete aphlict.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013417 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[21:50:26] <wikibugs>	 (03PS2) 10Dzahn: ssl: delete aphlict.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013415 (https://phabricator.wikimedia.org/T360413)
[21:51:39] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] ssl: delete aphlict.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013415 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn)
[21:53:16] <wikibugs>	 (03PS1) 10Bking: WIP: remove elasticsearch-curator dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647)
[21:55:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P59403 and previous config saved to /var/cache/conftool/dbconfig/20240403-215555-arnaudb.json
[21:57:41] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9686377 (10Dzahn)
[21:57:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9686379 (10Dzahn)
[21:58:05] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:58:26] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:58:39] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:59:13] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:59:26] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:59:35] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9686403 (10bking)
[21:59:50] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:00:05] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:00:06] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:00:28] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:00:41] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[22:05:31] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9686416 (10Dzahn) @eoghan I have continued with aphlict because I already had the patches uploaded anyways. But Phabricator is left if you still wanted to re-s...
[22:05:59] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:06:11] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:06:13] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:06:34] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:06:34] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Increase taskmanager parallelism and reduce batch size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858
[22:06:35] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Report container log output on backfilling failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016859
[22:06:36] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:06:44] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:09:29] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:09:38] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:09:46] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:11:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P59404 and previous config saved to /var/cache/conftool/dbconfig/20240403-221103-arnaudb.json
[22:19:56] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Tune resource usage of consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858
[22:20:42] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016861 (https://phabricator.wikimedia.org/T356933)
[22:22:39] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus: Tune resource usage of consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858
[22:26:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T360332)', diff saved to https://phabricator.wikimedia.org/P59405 and previous config saved to /var/cache/conftool/dbconfig/20240403-222610-arnaudb.json
[22:26:15] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[22:26:31] <wikibugs>	 (03PS2) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016861 (https://phabricator.wikimedia.org/T356933)
[22:26:46] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016861 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer)
[22:27:38] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016861 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer)
[22:34:13] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:34:46] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:34:50] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:35:09] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:45:32] <wikibugs>	 06SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685#9686467 (10colewhite)
[23:01:13] <wikibugs>	 (03PS1) 10Scott French: Improve etcdmirror shutdown behavior [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762)
[23:06:02] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] WMCS: Read from the new block/block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling)
[23:19:49] <wikibugs>	 (03PS2) 10Scott French: Improve etcdmirror shutdown behavior [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762)
[23:21:53] <wikibugs>	 (03CR) 10Scott French: "I ran into this while testing out the migration for T358636. It's a fairly simple fix and would make the procedure a bit less stressful :)" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762) (owner: 10Scott French)
[23:22:57] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:27:57] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:33:21] <TimStarling>	 !log on clouddb1021 ran maintain-views for enwiki
[23:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016376
[23:38:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016376 (owner: 10TrainBranchBot)
[23:44:59] <wikibugs>	 (03PS4) 10Krinkle: codesearch: Enable network=host and set CODESEARCH_HOUND_BASE [puppet] - 10https://gerrit.wikimedia.org/r/1016480