[00:26:22] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[1028-1033].eqiad.wmnet: Apply updated JVM — T356648 - eevans@cumin1002
[00:26:26] <stashbot>	 T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648
[00:29:07] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2021-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002
[00:38:35] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997500
[00:38:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997500 (owner: 10TrainBranchBot)
[00:41:12] <wikibugs>	 (03PS1) 10Eevans: sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405)
[00:42:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans)
[00:45:06] <wikibugs>	 (03PS2) 10Eevans: sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405)
[01:04:29] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997500 (owner: 10TrainBranchBot)
[01:08:49] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T356726 (10phaultfinder)
[01:18:52] <logmsgbot>	 !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase[2021-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002
[01:18:57] <stashbot>	 T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648
[01:19:49] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2026-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002
[01:28:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Allow pdns to query designate-mdns on private interfaces [puppet] - 10https://gerrit.wikimedia.org/r/997597 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott)
[01:36:17] <wikibugs>	 (03PS1) 10Jdlrobson: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728)
[01:49:19] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[01:55:23] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[01:57:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson)
[02:00:23] <logmsgbot>	 !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase[2026-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002
[02:00:36] <stashbot>	 T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648
[02:03:19] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2030-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002
[02:34:53] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[02:39:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Jhancock.wm)
[02:50:01] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0300)
[03:00:24] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[2030-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002
[03:00:39] <stashbot>	 T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648
[03:07:39] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.17 [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997501 (https://phabricator.wikimedia.org/T354435)
[03:07:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.17 [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997501 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot)
[03:09:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:26:38] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.17 [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997501 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot)
[04:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0400)
[04:02:12] <logmsgbot>	 !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.14 (duration: 02m 07s)
[04:03:29] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997631 (https://phabricator.wikimedia.org/T354435)
[04:03:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997631 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot)
[04:04:15] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997631 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot)
[04:04:45] <logmsgbot>	 !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.17  refs T354435
[04:04:52] <stashbot>	 T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435
[04:10:15] <icinga-wm>	 PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[04:14:47] <icinga-wm>	 RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[04:55:47] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.17  refs T354435 (duration: 51m 02s)
[04:55:51] <stashbot>	 T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435
[05:11:46] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:58:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1029 T351916', diff saved to https://phabricator.wikimedia.org/P56283 and previous config saved to /var/cache/conftool/dbconfig/20240206-055835-root.json
[05:58:40] <stashbot>	 T351916: Migrate es1 to Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351916
[05:59:13] <wikibugs>	 (03PS1) 10Marostegui: es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/997636 (https://phabricator.wikimedia.org/T351916)
[06:00:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/997636 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui)
[06:01:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1029.eqiad.wmnet with OS bookworm
[06:02:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[06:02:33] <icinga-wm>	 PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[06:02:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[06:05:37] <icinga-wm>	 RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[06:06:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:06:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:09:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P56284 and previous config saved to /var/cache/conftool/dbconfig/20240206-060942-root.json
[06:10:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[06:10:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[06:10:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[06:11:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[06:11:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T355609)', diff saved to https://phabricator.wikimedia.org/P56285 and previous config saved to /var/cache/conftool/dbconfig/20240206-061116-marostegui.json
[06:11:20] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[06:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:17:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T355609)', diff saved to https://phabricator.wikimedia.org/P56286 and previous config saved to /var/cache/conftool/dbconfig/20240206-061709-marostegui.json
[06:23:53] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db1186 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:27:37] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on db1186 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1094.52 seconds Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:32:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P56287 and previous config saved to /var/cache/conftool/dbconfig/20240206-063215-marostegui.json
[06:37:47] <logmsgbot>	 !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1029.eqiad.wmnet with OS bookworm
[06:38:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1029.eqiad.wmnet with OS bullseye
[06:47:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P56288 and previous config saved to /var/cache/conftool/dbconfig/20240206-064722-marostegui.json
[06:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, Amir1, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0700).
[07:02:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T355609)', diff saved to https://phabricator.wikimedia.org/P56289 and previous config saved to /var/cache/conftool/dbconfig/20240206-070228-marostegui.json
[07:02:31] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[07:02:39] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[07:02:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[07:02:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T355609)', diff saved to https://phabricator.wikimedia.org/P56290 and previous config saved to /var/cache/conftool/dbconfig/20240206-070251-marostegui.json
[07:07:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T355609)', diff saved to https://phabricator.wikimedia.org/P56291 and previous config saved to /var/cache/conftool/dbconfig/20240206-070708-marostegui.json
[07:22:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P56292 and previous config saved to /var/cache/conftool/dbconfig/20240206-072215-marostegui.json
[07:25:15] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1029.eqiad.wmnet with OS bullseye
[07:31:29] <wikibugs>	 (03PS2) 10Hoo man: Add wgVirtualDomainsMapping for Cognate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526)
[07:37:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P56293 and previous config saved to /var/cache/conftool/dbconfig/20240206-073721-marostegui.json
[07:37:31] <wikibugs>	 (03CR) 10Hoo man: Add wgVirtualDomainsMapping for Cognate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man)
[07:52:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T355609)', diff saved to https://phabricator.wikimedia.org/P56294 and previous config saved to /var/cache/conftool/dbconfig/20240206-075228-marostegui.json
[07:52:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[07:52:33] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[07:52:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[07:52:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T355609)', diff saved to https://phabricator.wikimedia.org/P56295 and previous config saved to /var/cache/conftool/dbconfig/20240206-075251-marostegui.json
[07:56:39] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[07:56:54] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[07:57:14] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[07:57:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T355609)', diff saved to https://phabricator.wikimedia.org/P56296 and previous config saved to /var/cache/conftool/dbconfig/20240206-075733-marostegui.json
[07:57:37] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[07:57:47] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0800).
[08:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:27] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[08:00:54] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[08:06:42] <logmsgbot>	 !log hoo@deploy2002 backport Cancelled
[08:07:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man)
[08:07:04] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: Deprecate nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/924901 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[08:07:42] <wikibugs>	 (03Merged) 10jenkins-bot: Add wgVirtualDomainsMapping for Cognate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man)
[08:08:46] <logmsgbot>	 !log hoo@deploy2002 Started scap: Backport for [[gerrit:994922|Add wgVirtualDomainsMapping for Cognate (T348526)]]
[08:08:50] <stashbot>	 T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526
[08:09:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:10:35] <logmsgbot>	 !log hoo@deploy2002 hoo: Backport for [[gerrit:994922|Add wgVirtualDomainsMapping for Cognate (T348526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:10:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'm +1 on the idea, not voting yet pending https://gerrit.wikimedia.org/r/c/operations/alerts/+/997253?usp=dashboard" [puppet] - 10https://gerrit.wikimedia.org/r/994735 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:11:13] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674)
[08:11:16] <logmsgbot>	 !log hoo@deploy2002 hoo: Continuing with sync
[08:11:22] <wikibugs>	 (03CR) 10Arnaudb: mariadb: will test converting instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[08:12:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P56297 and previous config saved to /var/cache/conftool/dbconfig/20240206-081239-marostegui.json
[08:14:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I believe this can be abandoned at this point ?" [puppet] - 10https://gerrit.wikimedia.org/r/990166 (https://phabricator.wikimedia.org/T354904) (owner: 10Cwhite)
[08:16:33] <wikibugs>	 (03CR) 10Muehlenhoff: systemd::unit: clean up ownership file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[08:17:38] <logmsgbot>	 !log hoo@deploy2002 Finished scap: Backport for [[gerrit:994922|Add wgVirtualDomainsMapping for Cognate (T348526)]] (duration: 08m 51s)
[08:17:42] <stashbot>	 T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526
[08:20:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete scap config for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/997483 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff)
[08:21:02] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db1186 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:22:29] <Kizule>	 Hi, is deployment window still running?
[08:23:25] <Kizule>	 *UTC morning backport
[08:27:12] <Kizule>	 Amir1, urbanecm?
[08:27:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P56298 and previous config saved to /var/cache/conftool/dbconfig/20240206-082746-marostegui.json
[08:27:48] <urbanecm>	 there was no patch in it AFAICS
[08:28:20] <Kizule>	 No, namespaceDupes has to be run on srwiki, and I'm hoping that we can finally do it.
[08:28:51] <Kizule>	 I didn't want to add it in the calendar because I thinked that UTC backport window is "marked as finished".
[08:28:58] <Kizule>	 There is a task. https://phabricator.wikimedia.org/T350431
[08:30:11] <Kizule>	 This patch is live so I'm guessing that we won't have surprises. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/995242
[08:32:37] <moritzm>	 !log pruning unneeded openjdk-17-jre-headless packages on ml-cache* hosts
[08:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:25] <wikibugs>	 (03PS18) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[08:41:18] <wikibugs>	 (03PS2) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694)
[08:41:40] <wikibugs>	 (03CR) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:42:06] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] SystemdUnitFailed: Increase the severity of a failed unit to critical. [alerts] - 10https://gerrit.wikimedia.org/r/997253 (owner: 10Slyngshede)
[08:42:39] <slyngs>	 !log Increase severity of failed systemd units when alerting from AlertManager
[08:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T355609)', diff saved to https://phabricator.wikimedia.org/P56299 and previous config saved to /var/cache/conftool/dbconfig/20240206-084253-marostegui.json
[08:42:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[08:42:57] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[08:43:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[08:43:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T355609)', diff saved to https://phabricator.wikimedia.org/P56300 and previous config saved to /var/cache/conftool/dbconfig/20240206-084315-marostegui.json
[08:43:43] <wikibugs>	 (03Merged) 10jenkins-bot: SystemdUnitFailed: Increase the severity of a failed unit to critical. [alerts] - 10https://gerrit.wikimedia.org/r/997253 (owner: 10Slyngshede)
[08:46:24] <wikibugs>	 (03PS10) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis)
[08:47:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:47:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:47:34] <moritzm>	 !log pruning unneeded openjdk-17-jre-headless packages on aqs* hosts
[08:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T355609)', diff saved to https://phabricator.wikimedia.org/P56301 and previous config saved to /var/cache/conftool/dbconfig/20240206-084858-marostegui.json
[08:49:03] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[08:50:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet
[08:52:08] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:52:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: prometheus-phpfpm-statustext-textfile.service Failed on mw2279:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:53:36] <wikibugs>	 (03CR) 10Vgutierrez: fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[09:01:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P56302 and previous config saved to /var/cache/conftool/dbconfig/20240206-090102-root.json
[09:01:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "nit inline, rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:01:54] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host build2001.codfw.wmnet
[09:02:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:03:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: jaeger: route jaeger-query to oauth2-proxy port [deployment-charts] - 10https://gerrit.wikimedia.org/r/997789 (https://phabricator.wikimedia.org/T320555)
[09:04:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P56303 and previous config saved to /var/cache/conftool/dbconfig/20240206-090405-marostegui.json
[09:07:48] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add db1235 to production [puppet] - 10https://gerrit.wikimedia.org/r/997503 (https://phabricator.wikimedia.org/T344036)
[09:08:16] <wikibugs>	 (03PS3) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694)
[09:12:16] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: toggle notifications for db1235 [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036)
[09:12:42] <wikibugs>	 (03Abandoned) 10Arnaudb: mariadb: add db1235 to production [puppet] - 10https://gerrit.wikimedia.org/r/997503 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:13:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete scap config for Netbox/Homer [puppet] - 10https://gerrit.wikimedia.org/r/997790
[09:16:00] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/997790 (owner: 10Muehlenhoff)
[09:16:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P56304 and previous config saved to /var/cache/conftool/dbconfig/20240206-091607-root.json
[09:18:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete scap config for Netbox/Homer [puppet] - 10https://gerrit.wikimedia.org/r/997790 (owner: 10Muehlenhoff)
[09:19:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P56305 and previous config saved to /var/cache/conftool/dbconfig/20240206-091911-marostegui.json
[09:20:51] <wikibugs>	 (03CR) 10Marostegui: "Green on icinga?" [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:21:09] <wikibugs>	 (03CR) 10Arnaudb: "yep!" [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:21:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db1235 [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:21:29] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db1235 [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:22:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 10%: 5', diff saved to https://phabricator.wikimedia.org/P56306 and previous config saved to /var/cache/conftool/dbconfig/20240206-092257-arnaudb.json
[09:25:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/997484 (https://phabricator.wikimedia.org/T356409) (owner: 10Slyngshede)
[09:26:22] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:26:22] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:30:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: jaeger: route trace.w.o to jaeger-query [deployment-charts] - 10https://gerrit.wikimedia.org/r/997789 (https://phabricator.wikimedia.org/T320555)
[09:30:25] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Make it clear what password is being reset [software/bitu] - 10https://gerrit.wikimedia.org/r/997484 (https://phabricator.wikimedia.org/T356409) (owner: 10Slyngshede)
[09:30:54] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:30:54] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:31:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P56307 and previous config saved to /var/cache/conftool/dbconfig/20240206-093112-root.json
[09:33:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet
[09:34:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T355609)', diff saved to https://phabricator.wikimedia.org/P56308 and previous config saved to /var/cache/conftool/dbconfig/20240206-093418-marostegui.json
[09:34:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[09:34:22] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[09:34:33] <wikibugs>	 (03PS1) 10Brouberol: Allow pods in the dse k8s cluster to reach an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/997792 (https://phabricator.wikimedia.org/T356623)
[09:34:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[09:34:35] <wikibugs>	 (03PS1) 10Brouberol: Allow pods in the dse k8s cluster to reach an-druid [puppet] - 10https://gerrit.wikimedia.org/r/997793 (https://phabricator.wikimedia.org/T356623)
[09:34:37] <wikibugs>	 (03PS1) 10Brouberol: Allow pods in the dse k8s cluster to reach public-druid [puppet] - 10https://gerrit.wikimedia.org/r/997794 (https://phabricator.wikimedia.org/T356623)
[09:34:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56309 and previous config saved to /var/cache/conftool/dbconfig/20240206-093440-marostegui.json
[09:37:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet
[09:38:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 20%: 5', diff saved to https://phabricator.wikimedia.org/P56310 and previous config saved to /var/cache/conftool/dbconfig/20240206-093803-arnaudb.json
[09:39:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56311 and previous config saved to /var/cache/conftool/dbconfig/20240206-093925-marostegui.json
[09:39:29] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[09:39:52] <wikibugs>	 (03PS11) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis)
[09:43:28] <wikibugs>	 (03PS19) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[09:45:04] <wikibugs>	 (03PS20) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[09:46:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P56312 and previous config saved to /var/cache/conftool/dbconfig/20240206-094617-root.json
[09:47:26] <akosiaris>	 !log roll restart all pods in wikikube@eqiad
[09:47:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:53:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 30%: 5', diff saved to https://phabricator.wikimedia.org/P56313 and previous config saved to /var/cache/conftool/dbconfig/20240206-095308-arnaudb.json
[09:54:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P56314 and previous config saved to /var/cache/conftool/dbconfig/20240206-095432-marostegui.json
[09:56:32] <moritzm>	 !log installing mariadb-10.5 security/bugfix updates from Bullseye point release (as packaged by Debian, unrelated to wmf-mariadb packages)
[09:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P56315 and previous config saved to /var/cache/conftool/dbconfig/20240206-100123-root.json
[10:01:27] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] wikireplicas: update-views: always run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) (owner: 10Majavah)
[10:01:30] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] libraryupgrader: migrate repo to gitlab [puppet] - 10https://gerrit.wikimedia.org/r/997547 (https://phabricator.wikimedia.org/T341417) (owner: 10Majavah)
[10:03:47] <wikibugs>	 (03PS2) 10Majavah: systemd: timer_service: Move ConditionPathExists to correct section [puppet] - 10https://gerrit.wikimedia.org/r/992888
[10:06:10] <wikibugs>	 (03Merged) 10jenkins-bot: wikireplicas: update-views: always run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) (owner: 10Majavah)
[10:06:12] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1281/console" [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah)
[10:07:06] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] systemd: timer_service: Move ConditionPathExists to correct section [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah)
[10:07:58] <wikibugs>	 (03PS1) 10Slyngshede: P:docker::builder clean docker image cache regularly. [puppet] - 10https://gerrit.wikimedia.org/r/997796
[10:08:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 40%: 5', diff saved to https://phabricator.wikimedia.org/P56316 and previous config saved to /var/cache/conftool/dbconfig/20240206-100813-arnaudb.json
[10:08:32] <wikibugs>	 (03CR) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:09:22] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] D:prometheus::blackbox::check::tcp allow specifying runbook. [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:09:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P56317 and previous config saved to /var/cache/conftool/dbconfig/20240206-100938-marostegui.json
[10:16:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P56319 and previous config saved to /var/cache/conftool/dbconfig/20240206-101628-root.json
[10:20:11] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet
[10:20:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1009.eqiad.wmnet
[10:22:19] <akosiaris>	 !log roll restart all pods in wikikube@codfw, wikikube@staging-codfw, wikikube@staging-eqiad
[10:22:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 50%: 5', diff saved to https://phabricator.wikimedia.org/P56320 and previous config saved to /var/cache/conftool/dbconfig/20240206-102318-arnaudb.json
[10:24:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56321 and previous config saved to /var/cache/conftool/dbconfig/20240206-102445-marostegui.json
[10:24:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[10:24:49] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[10:25:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[10:29:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[10:29:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[10:29:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T355609)', diff saved to https://phabricator.wikimedia.org/P56322 and previous config saved to /var/cache/conftool/dbconfig/20240206-102932-marostegui.json
[10:31:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P56323 and previous config saved to /var/cache/conftool/dbconfig/20240206-103133-root.json
[10:33:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T355609)', diff saved to https://phabricator.wikimedia.org/P56324 and previous config saved to /var/cache/conftool/dbconfig/20240206-103341-marostegui.json
[10:33:45] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[10:35:51] <wikibugs>	 (03PS12) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis)
[10:38:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 75%: 5', diff saved to https://phabricator.wikimedia.org/P56325 and previous config saved to /var/cache/conftool/dbconfig/20240206-103823-arnaudb.json
[10:40:40] <wikibugs>	 (03PS1) 10Btullis: Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040)
[10:41:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis)
[10:45:20] <wikibugs>	 (03PS1) 10Btullis: Configure reuse-parts for the analytics webserver [puppet] - 10https://gerrit.wikimedia.org/r/997798 (https://phabricator.wikimedia.org/T349398)
[10:45:37] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683]: (no justification provided)
[10:45:59] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683]: (no justification provided) (duration: 00m 22s)
[10:48:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P56326 and previous config saved to /var/cache/conftool/dbconfig/20240206-104848-marostegui.json
[10:49:29] <jinxer-wm>	 (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:49:52] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1009 is CRITICAL: /{src}/info.json (Get service info for osm-intl) is CRITICAL: Test Get service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[10:50:07] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Configure reuse-parts for the analytics webserver [puppet] - 10https://gerrit.wikimedia.org/r/997798 (https://phabricator.wikimedia.org/T349398) (owner: 10Btullis)
[10:53:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 100%: 5', diff saved to https://phabricator.wikimedia.org/P56327 and previous config saved to /var/cache/conftool/dbconfig/20240206-105328-arnaudb.json
[10:57:48] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[10:57:53] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s)
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1100)
[11:02:09] <wikibugs>	 (03PS2) 10Btullis: Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040)
[11:03:00] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-web1001.eqiad.wmnet with OS bullseye
[11:03:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis)
[11:03:49] <volans>	 btullis: FYI there is a possibility that the reimage gets stuck in debian-installer failing to get the proper netmask, we've got some failures yesterday and I'm looking at them, not sure yet if it affects all hosts
[11:03:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P56328 and previous config saved to /var/cache/conftool/dbconfig/20240206-110354-marostegui.json
[11:04:01] <volans>	 we did had a successful reimage yesterday too, so not sure
[11:04:12] <volans>	 you can keep an eye on the mgmt console to see the progress
[11:04:20] <Amir1>	 jouncebot: nowandnext
[11:04:20] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1100)
[11:04:20] <jouncebot>	 In 1 hour(s) and 55 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1300)
[11:05:12] <btullis>	 volans: OK, thanks. I'll be on the lookout and report back. It's using reuse-parts-test, so I expect it to wait in the installer at the partman screen, but I'll let you know if it doesn't get that far.
[11:05:41] <volans>	 thx
[11:07:35] <wikibugs>	 (03PS3) 10Btullis: Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040)
[11:10:46] <wikibugs>	 (03PS1) 10Brouberol: superset: configure extra TLS SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/997799 (https://phabricator.wikimedia.org/T356482)
[11:12:26] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[11:12:31] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s)
[11:13:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host stat1010.eqiad.wmnet
[11:13:39] <btullis>	 volans: yes I think it failed with a red screen in the installer, having failed to download the preseed file. I went back and selected 'configure network' again and it had the correct values displayed.
[11:14:33] <btullis>	 I can re-run the cookbook if it would be helpful to you, or I'm happy to continue. It has now downloaded the preseed successfully.
[11:15:01] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host stat1010.eqiad.wmnet
[11:16:08] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:16:32] <volans>	 btullis: that's super weird, no worries, I've plenty of hosts to play with
[11:17:19] <btullis>	 Ack, I'll probably continue then.
[11:17:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:18:36] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:19:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T355609)', diff saved to https://phabricator.wikimedia.org/P56329 and previous config saved to /var/cache/conftool/dbconfig/20240206-111901-marostegui.json
[11:19:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1244.eqiad.wmnet with reason: Maintenance
[11:19:05] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[11:19:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1244.eqiad.wmnet with reason: Maintenance
[11:19:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56330 and previous config saved to /var/cache/conftool/dbconfig/20240206-111923-marostegui.json
[11:19:54] <wikibugs>	 (03PS2) 10Ladsgroup: Switch the pagelinks default to add read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237)
[11:19:57] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Switch the pagelinks default to add read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[11:20:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[11:20:39] <wikibugs>	 (03Merged) 10jenkins-bot: Switch the pagelinks default to add read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[11:20:59] <Amir1>	 marostegui: FYI, ^ most wikis are going read new on pagelinks
[11:21:04] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:997420|Switch the pagelinks default to add read new (T351237)]]
[11:21:10] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[11:22:15] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[11:22:19] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 04s)
[11:22:37] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:997420|Switch the pagelinks default to add read new (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:25:09] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[11:25:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56331 and previous config saved to /var/cache/conftool/dbconfig/20240206-112514-marostegui.json
[11:25:18] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[11:27:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mw2374:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:28:48] <wikibugs>	 (03CR) 10Majavah: "netbox is using `service::uwsgi` which defaults to `deployment => 'scap3'` (which adds a `scap::target`), does that need updating?" [puppet] - 10https://gerrit.wikimedia.org/r/997790 (owner: 10Muehlenhoff)
[11:30:17] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.hosts.dhcp for host mw1408.eqiad.wmnet
[11:31:43] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:997420|Switch the pagelinks default to add read new (T351237)]] (duration: 10m 38s)
[11:31:47] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[11:32:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (22) prometheus-phpfpm-statustext-textfile.service Failed on mw1353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:34:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix matching block [puppet] - 10https://gerrit.wikimedia.org/r/997800
[11:37:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: remove Icinga-based systemd unit failed check [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764)
[11:37:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (36) prometheus-phpfpm-statustext-textfile.service Failed on mw1353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:39:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: remove absented statsd hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862)
[11:39:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Doesn't have to happen immediately" [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[11:39:31] <wikibugs>	 (03PS1) 10Volans: installserver: fix typo in preseed [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709)
[11:40:13] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-web1001.eqiad.wmnet with reason: host reimage
[11:40:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3315', diff saved to https://phabricator.wikimedia.org/P56332 and previous config saved to /var/cache/conftool/dbconfig/20240206-114020-marostegui.json
[11:40:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709) (owner: 10Volans)
[11:40:51] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/993742 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene)
[11:42:20] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/993743 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene)
[11:43:11] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-web1001.eqiad.wmnet with reason: host reimage
[11:43:32] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Fix matching block [puppet] - 10https://gerrit.wikimedia.org/r/997800 (owner: 10Muehlenhoff)
[11:44:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] installserver: fix typo in preseed [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709) (owner: 10Volans)
[11:44:41] <wikibugs>	 (03PS2) 10Volans: installserver: fix typo in preseed [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709)
[11:45:04] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831)
[11:45:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cache: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831)
[11:46:10] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host mw1408.eqiad.wmnet
[11:49:47] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-web1001.eqiad.wmnet with OS bullseye
[11:49:59] <wikibugs>	 (03CR) 10Volans: [C: 03+2] installserver: fix typo in preseed [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709) (owner: 10Volans)
[11:52:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache: remove nrpe::monitor_systemd_unit_state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:53:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] superset: configure extra TLS SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/997799 (https://phabricator.wikimedia.org/T356482) (owner: 10Brouberol)
[11:55:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3315', diff saved to https://phabricator.wikimedia.org/P56334 and previous config saved to /var/cache/conftool/dbconfig/20240206-115527-marostegui.json
[11:58:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1029.eqiad.wmnet with OS bookworm
[11:59:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1164 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:08] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1164 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:24] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1168 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:41] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[12:05:38] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1168 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:34] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:10:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56335 and previous config saved to /var/cache/conftool/dbconfig/20240206-121034-marostegui.json
[12:10:36] <volans>	 ok it seems the reimage issues have been fixed, if you encounter new issues please let us know (context in T356709 )
[12:10:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[12:10:39] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[12:10:39] <stashbot>	 T356709: Debian installer waits for input for network config during host reimage - https://phabricator.wikimedia.org/T356709
[12:10:48] <marostegui>	 volans: yeah, my reimage is working fine
[12:10:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[12:10:51] <marostegui>	 Thanks
[12:10:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1029.eqiad.wmnet with reason: host reimage
[12:10:57] <volans>	 nice
[12:11:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff)
[12:12:12] <wikibugs>	 (03CR) 10Btullis: "Looks great. Couple of questions inline, but looks ready to go. At least for this iteration." [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[12:12:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete
[12:13:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] P:httpbb: migrate tests from cumin1001 to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[12:13:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1029.eqiad.wmnet with reason: host reimage
[12:14:01] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] P:httpbb: clean up after move from cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/995109 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[12:14:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[12:14:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[12:15:44] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:34] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1175 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:15] <wikibugs>	 (03CR) 10Btullis: Add a deployment chart for Superset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[12:17:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1167 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:58] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1160 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:00] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1386.eqiad.wmnet with OS bullseye
[12:18:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1173 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:36] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1175 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:40] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[12:19:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1171 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:25] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[12:21:30] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1171 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:32] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1173 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:28] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[12:22:36] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1167 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:23:28] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[12:26:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1388.eqiad.wmnet with OS bullseye
[12:27:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/997778
[12:27:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/997506 (https://phabricator.wikimedia.org/T355907) (owner: 10Slyngshede)
[12:28:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1390.eqiad.wmnet with OS bullseye
[12:29:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1029.eqiad.wmnet with OS bookworm
[12:31:11] <wikibugs>	 (03PS1) 10Lucas Werkmeister: Load Filepage.css when previewing File pages [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505)
[12:31:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1386.eqiad.wmnet with reason: host reimage
[12:32:06] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1169 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:34:11] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:34:25] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1386.eqiad.wmnet with reason: host reimage
[12:34:59] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:36:15] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1169 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[12:37:39] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[12:39:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1388.eqiad.wmnet with reason: host reimage
[12:39:45] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674)
[12:40:21] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1392.eqiad.wmnet with OS bullseye
[12:40:28] <wikibugs>	 (03PS1) 10Slyngshede: Add gitreview configuration [software/bitu] - 10https://gerrit.wikimedia.org/r/997809 (https://phabricator.wikimedia.org/T355180)
[12:40:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1170 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:41:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[12:41:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1394.eqiad.wmnet with OS bullseye
[12:42:08] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1390.eqiad.wmnet with reason: host reimage
[12:42:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1388.eqiad.wmnet with reason: host reimage
[12:42:49] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1169 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[12:44:58] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1390.eqiad.wmnet with reason: host reimage
[12:45:04] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[12:45:04] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1396.eqiad.wmnet with OS bullseye
[12:45:09] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s)
[12:46:30] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1408.eqiad.wmnet with OS bullseye
[12:46:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:40] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[12:48:45] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1160 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:48:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1170 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:49:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2317.codfw.wmnet with OS bullseye
[12:50:13] <wikibugs>	 (03PS1) 10Btullis: Fix the reuse-analytics-raid1-2dev partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/997810 (https://phabricator.wikimedia.org/T349398)
[12:50:56] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[12:51:00] <logmsgbot>	 !log jgiannelos@deploy2002 deploy aborted: (no justification provided) (duration: 00m 04s)
[12:51:51] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1386.eqiad.wmnet with OS bullseye
[12:51:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/997778 (owner: 10Marostegui)
[12:52:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2318.codfw.wmnet with OS bullseye
[12:52:41] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fix the reuse-analytics-raid1-2dev partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/997810 (https://phabricator.wikimedia.org/T349398) (owner: 10Btullis)
[12:53:07] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:53:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1392.eqiad.wmnet with reason: host reimage
[12:54:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2319.codfw.wmnet with OS bullseye
[12:54:50] <moritzm>	 !log installing openjdk-11 security updates
[12:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:57] <wikibugs>	 (03PS1) 10Slyngshede: Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584)
[12:55:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1394.eqiad.wmnet with reason: host reimage
[12:56:08] <wikibugs>	 (03CR) 10Slyngshede: "Adding Bryan as a reviewer as well, for input on logo swap." [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede)
[12:56:54] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1392.eqiad.wmnet with reason: host reimage
[12:57:10] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[12:57:15] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s)
[12:58:17] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM vrts1002.eqiad.wmnet
[12:59:00] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1396.eqiad.wmnet with reason: host reimage
[12:59:30] <claime>	 !log Pruning images older than 45 days on build2001: docker image prune -a --filter "until=1080h"/25
[12:59:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:40] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1394.eqiad.wmnet with reason: host reimage
[12:59:48] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1408.eqiad.wmnet with reason: host reimage
[12:59:51] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-web1001.eqiad.wmnet with OS bullseye
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1300)
[13:00:46] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[13:00:47] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 01s)
[13:00:53] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[13:00:54] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 01s)
[13:01:54] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1396.eqiad.wmnet with reason: host reimage
[13:02:25] <claime>	 !log build2001 - Total reclaimed space: 23.31GB
[13:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:38] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM vrts1002.eqiad.wmnet
[13:02:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2350.codfw.wmnet with OS bullseye
[13:03:23] <claime>	 !log Relaunching build-production-images
[13:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1390.eqiad.wmnet with OS bullseye
[13:04:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1408.eqiad.wmnet with reason: host reimage
[13:05:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2352.codfw.wmnet with OS bullseye
[13:06:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1388.eqiad.wmnet with OS bullseye
[13:07:31] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] prometheus-php-fpm-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 (owner: 10Clément Goubert)
[13:07:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2354.codfw.wmnet with OS bullseye
[13:07:54] <wikibugs>	 (03PS3) 10Volans: P:httpbb: migrate tests from cumin1001 to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[13:07:59] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[13:08:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[13:08:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cache: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831)
[13:08:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2317.codfw.wmnet with reason: host reimage
[13:08:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2318.codfw.wmnet with reason: host reimage
[13:09:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] cache: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[13:09:44] <wikibugs>	 (03PS1) 10Btullis: Fix the reuse-analytics-raid1-2dev recipe [puppet] - 10https://gerrit.wikimedia.org/r/997812 (https://phabricator.wikimedia.org/T349398)
[13:10:11] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-web1001.eqiad.wmnet with OS bullseye
[13:10:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1388.eqiad.wmnet
[13:10:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1388.eqiad.wmnet
[13:11:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2319.codfw.wmnet with reason: host reimage
[13:11:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2318.codfw.wmnet with reason: host reimage
[13:11:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2356.codfw.wmnet with OS bullseye
[13:11:46] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831)
[13:11:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cassandra: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997814 (https://phabricator.wikimedia.org/T337831)
[13:11:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fix the reuse-analytics-raid1-2dev recipe [puppet] - 10https://gerrit.wikimedia.org/r/997812 (https://phabricator.wikimedia.org/T349398) (owner: 10Btullis)
[13:11:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831)
[13:12:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: chartmuseum: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831)
[13:12:04] <wikibugs>	 (03PS1) 10Filippo Giunchedi: docker_registry: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831)
[13:12:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: envoy: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997818 (https://phabricator.wikimedia.org/T337831)
[13:12:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: mediawiki: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831)
[13:12:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: etcd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831)
[13:12:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831)
[13:13:30] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Allow pods in the dse k8s cluster to reach public-druid [puppet] - 10https://gerrit.wikimedia.org/r/997794 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[13:13:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Allow pods in the dse k8s cluster to reach an-druid [puppet] - 10https://gerrit.wikimedia.org/r/997793 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[13:13:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2319.codfw.wmnet with reason: host reimage
[13:14:07] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[13:14:12] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s)
[13:14:21] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] superset: configure extra TLS SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/997799 (https://phabricator.wikimedia.org/T356482) (owner: 10Brouberol)
[13:15:01] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "nit: In fact this role is applied to 3 hosts, an-coord100[1,3,4]." [puppet] - 10https://gerrit.wikimedia.org/r/997792 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[13:15:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1392.eqiad.wmnet with OS bullseye
[13:15:37] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-web1001.eqiad.wmnet with OS bullseye
[13:16:37] <moritzm>	 !log pruning unneeded openjdk-17-jre-headless packages on restbase* hosts
[13:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1394.eqiad.wmnet with OS bullseye
[13:18:24] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:19] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS bookworm
[13:19:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2350.codfw.wmnet with reason: host reimage
[13:20:22] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1396.eqiad.wmnet with OS bullseye
[13:20:26] <icinga-wm>	 RECOVERY - Disk space on build2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops
[13:20:56] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Allow pods in the dse k8s cluster to reach an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/997792 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[13:20:57] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[13:21:03] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s)
[13:21:56] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided)
[13:22:09] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 12s)
[13:22:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2350.codfw.wmnet with reason: host reimage
[13:22:27] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2352.codfw.wmnet with reason: host reimage
[13:22:51] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1408.eqiad.wmnet with OS bullseye
[13:24:07] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2354.codfw.wmnet with reason: host reimage
[13:25:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2352.codfw.wmnet with reason: host reimage
[13:27:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2354.codfw.wmnet with reason: host reimage
[13:28:00] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2356.codfw.wmnet with reason: host reimage
[13:28:00] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2317.codfw.wmnet with OS bullseye
[13:29:00] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (codfw): (no justification provided)
[13:29:18] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (codfw): (no justification provided) (duration: 00m 17s)
[13:29:41] <wikibugs>	 (03PS1) 10Brouberol: Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846
[13:30:12] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Allow pods in the dse k8s cluster to reach an-druid [puppet] - 10https://gerrit.wikimedia.org/r/997793 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[13:30:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2318.codfw.wmnet with OS bullseye
[13:30:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2356.codfw.wmnet with reason: host reimage
[13:32:49] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[13:32:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2319.codfw.wmnet with OS bullseye
[13:33:37] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Change LGTM, but I think will be cleaner with a minor hiera change, see inline." [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[13:33:44] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-web1001.eqiad.wmnet with reason: host reimage
[13:34:13] <moritzm>	 !log installing openjdk-17 security updates
[13:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:18] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Allow pods in the dse k8s cluster to reach public-druid [puppet] - 10https://gerrit.wikimedia.org/r/997794 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[13:36:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P56337 and previous config saved to /var/cache/conftool/dbconfig/20240206-133619-root.json
[13:36:30] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-web1001.eqiad.wmnet with reason: host reimage
[13:37:30] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[13:37:31] <wikibugs>	 (03PS2) 10Brouberol: Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623)
[13:37:38] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (codfw): Ensure that all codfw nodes are running the same revision
[13:37:46] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage
[13:38:11] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (codfw): Ensure that all codfw nodes are running the same revision (duration: 00m 32s)
[13:38:46] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): Ensure that all eqiad nodes are running the same revision
[13:39:17] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): Ensure that all eqiad nodes are running the same revision (duration: 00m 31s)
[13:39:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:39:37] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:40:31] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Idle - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:40:33] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage
[13:41:38] <vgutierrez>	 k8s@codfw BGP alert is expected?
[13:42:55] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:45:36] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2352.codfw.wmnet with OS bullseye
[13:45:43] <wikibugs>	 (03PS13) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis)
[13:47:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2354.codfw.wmnet with OS bullseye
[13:48:53] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] jaeger: route trace.w.o to jaeger-query [deployment-charts] - 10https://gerrit.wikimedia.org/r/997789 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[13:50:04] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=config-master,name=codfw
[13:50:14] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2356.codfw.wmnet with OS bullseye
[13:51:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P56338 and previous config saved to /var/cache/conftool/dbconfig/20240206-135124-root.json
[13:51:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host config-master2001.codfw.wmnet
[13:55:27] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:55:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host config-master2001.codfw.wmnet
[13:56:28] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=config-master,name=codfw
[13:56:50] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=config-master,name=eqiad
[13:57:03] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=config-master,name=eqiad
[13:57:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host config-master1001.eqiad.wmnet
[13:59:44] <wikibugs>	 (03PS3) 10Brouberol: Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1400).
[14:00:05] <jouncebot>	 Kizule and lucaswerkmeister: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:21] <lucaswerkmeister>	 o/
[14:00:41] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-web1001.eqiad.wmnet with OS bullseye
[14:01:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[14:01:26] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS bookworm
[14:01:47] <Lucas_WMDE>	 I can deploy
[14:01:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host config-master1001.eqiad.wmnet
[14:02:10] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=config-master,name=eqiad
[14:02:51] <wikibugs>	 (03PS4) 10Brouberol: Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623)
[14:02:57] <Lucas_WMDE>	 oh, Kizule removed the maintenance script request apparently
[14:03:26] <Lucas_WMDE>	 ah, https://phabricator.wikimedia.org/T350431#9517284
[14:03:27] <Lucas_WMDE>	 :/
[14:04:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister)
[14:04:43] <lucaswerkmeister>	 I can reproduce T356505 at https://commons.wikimedia.org/w/index.php?title=File:CSD_Berlin_2019_-_Lucas_Werkmeister_-_24_-_Bi,_Pan,_Ace_Flags.jpg&action=submit
[14:04:44] <stashbot>	 T356505: File page edit preview does not load Filepage.css - https://phabricator.wikimedia.org/T356505
[14:04:47] <lucaswerkmeister>	 so I should be able to test the fix there
[14:05:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[14:05:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::webserver
[14:05:13] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[14:05:35] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol)
[14:05:45] <wikibugs>	 (03CR) 10Marostegui: "I will take care of merging this myself - as I want to issue a puppet run on the masters right after merging to make sure nothing gets wei" [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[14:06:41] <anzx>	 Lucas_WMDE: is it required run namespacedupes.php if all were accessible after deployment for T355662 and need to run namespacedupes.php for T349581 should I add it to calendar 
[14:06:41] <stashbot>	 T355662: Create portal namespace on kannada wikipedia  - https://phabricator.wikimedia.org/T355662
[14:06:41] <stashbot>	 T349581: Create draft namespace and add namespaces aliases for hewikinews - https://phabricator.wikimedia.org/T349581
[14:07:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch an-web to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/997850 (https://phabricator.wikimedia.org/T349619)
[14:07:16] * Lucas_WMDE looks
[14:07:45] <Lucas_WMDE>	 I probably should’ve run it there and forgot, yeah
[14:07:47] <Lucas_WMDE>	 let me check now
[14:08:05] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10Jhancock.wm) This rack is physically ready for tomorrow.
[14:08:22] <Lucas_WMDE>	 oh yeah knwiki has plenty of links to fix apparently
[14:08:35] <Lucas_WMDE>	 (no pages to fix, but would still be nice to fix the links)
[14:08:52] <Lucas_WMDE>	 likewise hewikinews (though with fewer links to fix)
[14:09:27] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes knwiki --fix # T355662 (crashed)
[14:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/997468 (https://phabricator.wikimedia.org/T355172) (owner: 10Slyngshede)
[14:09:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch an-web to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/997850 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:10:48] <Lucas_WMDE>	 anzx: looks like the maintenance script needs to be fixed first
[14:11:58] <wikibugs>	 (03CR) 10Slyngshede: profile: remove Icinga-based systemd unit failed check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[14:12:09] <wikibugs>	 (03CR) 10Slyngshede: [C: 04-1] profile: remove Icinga-based systemd unit failed check [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[14:12:28] <Lucas_WMDE>	 ugh, and one of the gate-and-submits for the backport failed with ECONNRESET in npm
[14:12:49] <anzx>	 Lucas_WMDE: will ask again later in few days, thanks 
[14:13:04] <Lucas_WMDE>	 sounds good, thanks!
[14:13:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Load Filepage.css when previewing File pages [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister)
[14:13:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister)
[14:13:19] <Lucas_WMDE>	 let’s try that again…
[14:14:25] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T356726 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[14:14:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::webserver
[14:16:15] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:16:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-web1001.eqiad.wmnet
[14:16:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister)
[14:16:53] <Lucas_WMDE>	 (not sure why `scap backport` exited early there for some reason while the build was still ongoing… I started it again now)
[14:17:50] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[14:19:45] <wikibugs>	 (03PS1) 10Slyngshede: Allow users to view the entire SSH key [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140)
[14:20:46] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Password reset: Allow signed in users to navigate. [software/bitu] - 10https://gerrit.wikimedia.org/r/997506 (https://phabricator.wikimedia.org/T355907) (owner: 10Slyngshede)
[14:21:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P56340 and previous config saved to /var/cache/conftool/dbconfig/20240206-142134-root.json
[14:22:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-web1001.eqiad.wmnet
[14:22:35] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] CI Fix broken tests. [software/bitu] - 10https://gerrit.wikimedia.org/r/997468 (https://phabricator.wikimedia.org/T355172) (owner: 10Slyngshede)
[14:26:53] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10Jhancock.wm) @BTullis I can reseat the backplane to try and fix this. Is it safe for me to do so? or are you currently working...
[14:28:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) >>! In T355830#9517383, @Jhancock.wm wrote: > @BTullis I can reseat the backplane to try and fix this. Is it safe for...
[14:32:31] <Emperor>	 !log debug convert-disks cookbook against out-of-use ms-be2044 T308677
[14:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:35] <stashbot>	 T308677: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677
[14:32:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2044
[14:33:00] <wikibugs>	 (03Merged) 10jenkins-bot: Load Filepage.css when previewing File pages [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister)
[14:33:13] <Lucas_WMDE>	 finally
[14:33:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:997779|Load Filepage.css when previewing File pages (T356505)]]
[14:33:32] <stashbot>	 T356505: File page edit preview does not load Filepage.css - https://phabricator.wikimedia.org/T356505
[14:34:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: "SGTM, thank you Manuel for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[14:35:10] <wikibugs>	 (03CR) 10Muehlenhoff: Allow users to view the entire SSH key (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) (owner: 10Slyngshede)
[14:35:27] <Lucas_WMDE>	 7m, apparently 14 k8s nodes are taking longer to docker pull the new image
[14:35:29] <Lucas_WMDE>	 *hm
[14:36:16] <wikibugs>	 (03PS1) 10Brouberol: service: register superset and superset-next under ingress [puppet] - 10https://gerrit.wikimedia.org/r/997857 (https://phabricator.wikimedia.org/T356483)
[14:36:23] <Lucas_WMDE>	 ok, that finished now
[14:36:28] <wikibugs>	 (03PS2) 10Brouberol: Add superset/superset-next.svc.eqiad.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/995174 (https://phabricator.wikimedia.org/T356481)
[14:36:36] <wikibugs>	 (03PS1) 10Brouberol: superset: setup dyna mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/997858 (https://phabricator.wikimedia.org/T356481)
[14:36:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P56341 and previous config saved to /var/cache/conftool/dbconfig/20240206-143639-root.json
[14:36:40] <wikibugs>	 (03PS1) 10Brouberol: Superset: setup temporary external domains for the k8s deployments [dns] - 10https://gerrit.wikimedia.org/r/997859 (https://phabricator.wikimedia.org/T356482)
[14:36:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and lucaswerkmeister: Backport for [[gerrit:997779|Load Filepage.css when previewing File pages (T356505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:37:20] <lucaswerkmeister>	 seems to work fine \o/
[14:37:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and lucaswerkmeister: Continuing with sync
[14:38:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1009.wikimedia.org
[14:39:11] <wikibugs>	 (03PS2) 10Brouberol: service: register superset and superset-next under ingress [puppet] - 10https://gerrit.wikimedia.org/r/997857 (https://phabricator.wikimedia.org/T356483)
[14:39:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:39:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:42:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10Jhancock.wm) @BTullis looks like it worked. But since that backplane error occurred twice already, if it happens again lmk and...
[14:44:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:997779|Load Filepage.css when previewing File pages (T356505)]] (duration: 10m 51s)
[14:44:18] <stashbot>	 T356505: File page edit preview does not load Filepage.css - https://phabricator.wikimedia.org/T356505
[14:44:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) prometheus-phpfpm-statustext-textfile.service Failed on mw1371:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:44:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) >>! In T355830#9517443, @Jhancock.wm wrote: > @BTullis looks like it worked. But since that backplane error occurred...
[14:44:58] <lucaswerkmeister>	 now it’s also working without mwdebug, as far as I can tell \o/
[14:45:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[14:45:25] <Lucas_WMDE>	 ok, then I think we’re done!
[14:45:31] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:41] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw
[14:48:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1009.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[14:49:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1009.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[14:49:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:49:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1009.wikimedia.org
[14:49:26] <jinxer-wm>	 (SystemdUnitFailed) resolved: (40) prometheus-phpfpm-statustext-textfile.service Failed on mw1357:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:33] <wikibugs>	 (03PS1) 10Muehlenhoff: New cookbook to reboot/restart config-master hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/997887
[14:50:48] <jinxer-wm>	 (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:51:13] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2044
[14:51:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P56343 and previous config saved to /var/cache/conftool/dbconfig/20240206-145144-root.json
[14:51:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend config-master Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/997888
[14:52:28] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/997814 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[14:52:33] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: handle Zuul 'Merge failed' messages [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997889 (https://phabricator.wikimedia.org/T356647)
[14:52:58] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] wm-checks-api: handle Zuul 'Merge failed' messages [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997889 (https://phabricator.wikimedia.org/T356647) (owner: 10Hashar)
[14:53:31] <wikibugs>	 (03PS2) 10Filippo Giunchedi: profile: remove absented statsd hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862)
[14:53:35] <wikibugs>	 (03PS2) 10Filippo Giunchedi: profile: remove Icinga-based systemd unit failed check [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764)
[14:53:41] <wikibugs>	 (03Merged) 10jenkins-bot: wm-checks-api: handle Zuul 'Merge failed' messages [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997889 (https://phabricator.wikimedia.org/T356647) (owner: 10Hashar)
[14:54:24] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@2e441ac]: wm-checks-api: handle Zuul 'Merge failed' messages - T356647
[14:54:30] <stashbot>	 T356647: wmf-checks-api: Gerrit checks display lists "merge failed" as success - https://phabricator.wikimedia.org/T356647
[14:54:31] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@2e441ac]: wm-checks-api: handle Zuul 'Merge failed' messages - T356647 (duration: 00m 07s)
[14:54:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: profile: remove Icinga-based systemd unit failed check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[14:54:40] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye
[14:56:37] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[14:59:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:52] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cassandra: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997814 (https://phabricator.wikimedia.org/T337831)
[15:02:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] cassandra: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997814 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[15:03:11] <wikibugs>	 (03CR) 10MVernon: "Hi!" [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[15:04:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] jaeger: route trace.w.o to jaeger-query [deployment-charts] - 10https://gerrit.wikimedia.org/r/997789 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[15:06:29] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] libraryupgrader: use system docker on newer Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/997548 (owner: 10Majavah)
[15:06:37] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on netbox1002.eqiad.wmnet with reason: Restoring DB from backup on netboxdb1002
[15:06:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P56344 and previous config saved to /var/cache/conftool/dbconfig/20240206-150649-root.json
[15:06:54] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on netbox1002.eqiad.wmnet with reason: Restoring DB from backup on netboxdb1002
[15:07:50] <topranks>	 !log Disabling netbox service on netbox1002 prior to db restore from backup 
[15:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:41] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw
[15:14:10] <logmsgbot>	 !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[15:14:24] <logmsgbot>	 !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[15:15:41] <icinga-wm>	 PROBLEM - netbox Postgres on netboxdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB netbox (host:localhost) 22039184 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:16:48] <wikibugs>	 (03PS1) 10Clément Goubert: prometheus-apache-exporter: Bump version to 0.0.4 [puppet] - 10https://gerrit.wikimedia.org/r/997894 (https://phabricator.wikimedia.org/T283861)
[15:16:55] <icinga-wm>	 RECOVERY - netbox Postgres on netboxdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB netbox (host:localhost) 0 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:17:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:20:02] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1284/co" [puppet] - 10https://gerrit.wikimedia.org/r/997894 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert)
[15:23:25] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2094.codfw.wmnet with reason: host reimage
[15:25:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860
[15:25:50] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860
[15:25:54] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[15:26:22] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2094.codfw.wmnet with reason: host reimage
[15:26:33] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=wdqs2016.codfw.wmnet
[15:27:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] P:docker::builder clean docker image cache regularly. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997796 (owner: 10Slyngshede)
[15:27:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860
[15:27:40] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860
[15:28:42] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[15:29:16] <wikibugs>	 (03PS1) 10Clément Goubert: prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861)
[15:34:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning all hosts in row B4 for switch maintenance - bking@cumin2002 - T355860
[15:34:17] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning all hosts in row B4 for switch maintenance - bking@cumin2002 - T355860
[15:34:21] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[15:37:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860
[15:37:19] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860
[15:37:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:41:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: eelastic2058,elastic2070,elastic2095,elastic2096 for switch maintenance - bking@cumin2002 - T355860
[15:41:52] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: eelastic2058,elastic2070,elastic2095,elastic2096 for switch maintenance - bking@cumin2002 - T355860
[15:41:55] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[15:41:57] <logmsgbot>	 !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: sync on main
[15:42:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:42:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2058,elastic2070,elastic2095,elastic2096 for switch maintenance - bking@cumin2002 - T355860
[15:42:36] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2058,elastic2070,elastic2095,elastic2096 for switch maintenance - bking@cumin2002 - T355860
[15:43:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2058 for switch maintenance - bking@cumin2002 - T355860
[15:43:18] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2058 for switch maintenance - bking@cumin2002 - T355860
[15:43:28] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [restbase/deploy@05fa5c9]: Disabling storage for ptwiki
[15:43:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2058* for switch maintenance - bking@cumin2002 - T355860
[15:43:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2058* for switch maintenance - bking@cumin2002 - T355860
[15:44:00] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2094.codfw.wmnet with OS bullseye
[15:44:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2058*,elastic2070*,elastic2095*,elastic2096* for switch maintenance - bking@cumin2002 - T355860
[15:44:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2094.codfw.wmnet with OS...
[15:44:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2058*,elastic2070*,elastic2095*,elastic2096* for switch maintenance - bking@cumin2002 - T355860
[15:44:29] <jinxer-wm>	 (ProbeDown) firing: (2) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:46:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mediawiki: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[15:46:40] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @Jhancock.wm - The reimage cookbook hung once at PXE boot, but I gave it a...
[15:46:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[15:47:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[15:47:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:47:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[15:48:25] <jinxer-wm>	 (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:48:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1009.wikimedia.org
[15:49:24] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2033-2034].codfw.wmnet with reason: T355860
[15:50:55] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[15:51:09] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2033-2034].codfw.wmnet with reason: T355860
[15:51:17] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, 10Epic: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10TK-999)
[15:51:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:52:34] <logmsgbot>	 !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[15:53:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:53:21] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudelastic1009.wikimedia.org
[15:54:29] <jinxer-wm>	 (ProbeDown) firing: (2) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:55:54] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:56:33] <topranks>	 !log moving Netbox server uplinks from asw-b4-codfw to lsw1-b4-codfw to prep config for server moves T355860
[15:56:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:40] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[15:56:47] <wikibugs>	 (03CR) 10Clément Goubert: "Removing vote until I understand our systemd monitoring a little better" [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[15:57:54] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:23] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-b-codfw,lsw1-b4-codfw.mgmt with reason: prepping for server uplink migration
[15:58:39] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-b-codfw,lsw1-b4-codfw.mgmt with reason: prepping for server uplink migration
[15:58:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1cb41722-6e24-4871-a903-cdb117a03449) set by cmooney...
[15:58:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:58:55] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr[1-2]-codfw with reason: prepping for server uplink migration
[15:58:57] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10isarantopoulos) I tried to delete a revision and an inferenceservice on experimental namespace and it seems that I don't have access:  ` kubectl delete re...
[15:59:10] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr[1-2]-codfw with reason: prepping for server uplink migration
[15:59:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:59:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2349fc0-73a1-418a-b3b8-284c8a40d573) set by cmooney...
[15:59:29] <jinxer-wm>	 (ProbeDown) firing: (2) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:00:05] <jouncebot>	 eoghan, jelto, and arnoldokoth: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1600).
[16:00:09] <topranks>	 !log configuring lsw1-b4-codfw with port config for new hosts T355860
[16:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:08] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@05fa5c9]: Disabling storage for ptwiki (duration: 17m 39s)
[16:01:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:01:26] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:02:12] <icinga-wm>	 PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:02:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:02:41] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw
[16:02:57] <jinxer-wm>	 (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:03:04] <icinga-wm>	 RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:03:13] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw
[16:03:13] <vgutierrez>	 uh
[16:03:18] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:03:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a3c16d29-3284-4390-9f38-033ef67e36ff) set by cmooney...
[16:03:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:04:29] <jinxer-wm>	 (ProbeDown) firing: (6) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:04:31] <wikibugs>	 (03CR) 10Brouberol: Add a deployment chart for Superset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[16:04:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1009 to private IPs - bking@cumin2002"
[16:05:15] <topranks>	 !log Commencing server uplink moves from old switch  to new in codfw rack B4 T355860
[16:05:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:19] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[16:05:25] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901
[16:05:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1009 to private IPs - bking@cumin2002"
[16:05:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:07:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1009
[16:07:57] <jinxer-wm>	 (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:08:18] <wikibugs>	 (03CR) 10Brouberol: Add a deployment chart for Superset (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[16:08:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1009
[16:10:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[16:10:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[16:10:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T355609)', diff saved to https://phabricator.wikimedia.org/P56347 and previous config saved to /var/cache/conftool/dbconfig/20240206-161043-marostegui.json
[16:10:47] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[16:10:49] <topranks>	 !log Hosts migrated and basic connectivity ok codfw rack B4 T355860
[16:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:54] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[16:12:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:12:58] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:13:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) All hosts moved successfully, all now responding to pings fine and MAC forwarding tables look correct.
[16:13:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:15:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:17:51] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:18:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw
[16:18:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw
[16:18:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T355609)', diff saved to https://phabricator.wikimedia.org/P56348 and previous config saved to /var/cache/conftool/dbconfig/20240206-161849-marostegui.json
[16:18:53] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[16:20:48] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 (owner: 10Ilias Sarantopoulos)
[16:21:38] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 (owner: 10Ilias Sarantopoulos)
[16:23:00] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:31] <wikibugs>	 (03PS1) 10Brouberol: ferm: fix typo in the public druid ferm_srange rule [puppet] - 10https://gerrit.wikimedia.org/r/997906
[16:24:37] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 (owner: 10Ilias Sarantopoulos)
[16:25:30] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 (owner: 10Ilias Sarantopoulos)
[16:25:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[16:25:59] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/997906 (owner: 10Brouberol)
[16:26:26] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah)
[16:26:30] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:26:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2087*,elastic2037*,elastic2038*,elastic2055*,elastic2088*,elastic2073*,elastic2074* for switch maintenance - bking@cumin2002 - T355860
[16:26:43] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[16:26:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2087*,elastic2037*,elastic2038*,elastic2055*,elastic2088*,elastic2073*,elastic2074* for switch maintenance - bking@cumin2002 - T355860
[16:26:46] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] ferm: fix typo in the public druid ferm_srange rule [puppet] - 10https://gerrit.wikimedia.org/r/997906 (owner: 10Brouberol)
[16:26:59] <Daimona>	 !log T353459 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list.txt
[16:27:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:03] <stashbot>	 T353459: Develop a prototype for Event Invitations with scoring on likelihood of valuable participation - https://phabricator.wikimedia.org/T353459
[16:27:10] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:52] <claime>	 erm looks like we got a problem with mw-on-k8s
[16:29:10] <claime>	 Trying to find out what
[16:29:28] <claime>	 I can curl but I get random timeouts with httpbb
[16:29:30] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp[2033-2034].codfw.wmnet
[16:29:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp[2033-2034].codfw.wmnet
[16:29:35] <claime>	 Probably some pods in a bad state
[16:29:41] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad
[16:29:50] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=wdqs2016.codfw.wmnet
[16:30:19] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=(cdn|ats-be)
[16:30:20] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=(cdn|ats-be)
[16:30:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[16:33:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56349 and previous config saved to /var/cache/conftool/dbconfig/20240206-163355-marostegui.json
[16:34:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cloudelastic1009.mgmt.eqiad.wmnet on all recursors
[16:34:10] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-2] "Think we need to hold back because of https://phabricator.wikimedia.org/T356787 - these hosts are still buster." [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[16:34:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudelastic1009.mgmt.eqiad.wmnet on all recursors
[16:34:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-2] "Think we need to hold back because of https://phabricator.wikimedia.org/T356787 - these hosts are still buster." [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[16:34:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:35:16] <claime>	 !log Roll-restarting mw-api-ext deployment in codfw
[16:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:38] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye
[16:35:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "I think this is fine as the resource was absent anyways" [puppet] - 10https://gerrit.wikimedia.org/r/997818 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[16:35:59] <wikibugs>	 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10MatthewVernon)
[16:36:44] <_joe_>	 jouncebot: nowandnext
[16:36:44] <jouncebot>	 For the next 0 hour(s) and 23 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1600)
[16:36:44] <jouncebot>	 In 0 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1700)
[16:38:07] <wikibugs>	 (03CR) 10Scott French: "Thanks again for the review. Also that's great - I'd not seen `Hosts: auto` before." [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[16:38:37] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[16:38:40] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:38:48] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:39:12] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:39:51] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:40:48] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:40:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:40:56] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:41:24] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:41:36] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans)
[16:42:38] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] etcd: remove nrpe::monitor_systemd_unit_state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[16:43:09] <wikibugs>	 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10jcrespo) Potentially a similar issue (request/traffic related high load) happened around the past 28 of September, when I added this TOD...
[16:45:06] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:47:06] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:47:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:49:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56352 and previous config saved to /var/cache/conftool/dbconfig/20240206-164902-marostegui.json
[16:49:14] <wikibugs>	 (03CR) 10MVernon: "I think the change to spare has to be wrong (as the role doesn't exist)." [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans)
[16:50:05] <wikibugs>	 (03CR) 10MVernon: "Sorry, having seen the other change and PCC error, I think that problem must fit here as well - there isn't a spare::system role that I ca" [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans)
[16:51:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage
[16:53:19] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/997888 (owner: 10Muehlenhoff)
[16:53:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:53:43] <wikibugs>	 (03PS21) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[16:54:09] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: Corto: internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10lmata)
[16:54:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage
[16:54:37] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10lmata)
[16:54:44] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad
[16:54:48] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10jhathaway)
[16:55:17] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10jhathaway)
[16:55:53] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:56:22] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10jhathaway)
[16:56:55] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10MatthewVernon) swift backends look happy, thanks :)
[16:57:33] <wikibugs>	 (03PS3) 10Eevans: sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405)
[16:58:15] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10fgiunchedi)
[16:58:23] <wikibugs>	 (03PS3) 10Eevans: Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469)
[16:59:53] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis)
[17:00:05] <jouncebot>	 jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1700). Please do the needful.
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:01:07] <wikibugs>	 (03PS4) 10JHathaway: rsyslog: have rsyslog create its own files [puppet] - 10https://gerrit.wikimedia.org/r/997555
[17:03:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:03:55] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:04:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T355609)', diff saved to https://phabricator.wikimedia.org/P56353 and previous config saved to /var/cache/conftool/dbconfig/20240206-170408-marostegui.json
[17:04:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[17:04:19] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[17:04:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[17:04:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T355609)', diff saved to https://phabricator.wikimedia.org/P56354 and previous config saved to /var/cache/conftool/dbconfig/20240206-170431-marostegui.json
[17:05:23] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:06:08] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans)
[17:06:45] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans)
[17:06:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[17:08:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:08:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:08:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:10:33] <wikibugs>	 (03CR) 10BryanDavis: Provide context for account creation. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede)
[17:11:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002"
[17:12:37] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:12:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T355609)', diff saved to https://phabricator.wikimedia.org/P56355 and previous config saved to /var/cache/conftool/dbconfig/20240206-171240-marostegui.json
[17:12:45] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[17:13:41] <wikibugs>	 (03PS4) 10Eevans: sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405)
[17:13:47] <icinga-wm>	 RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[17:13:51] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: (2) Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:15:39] <wikibugs>	 (03PS1) 10Andrew Bogott: rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921
[17:16:27] <wikibugs>	 (03PS2) 10Andrew Bogott: rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921
[17:17:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:17:37] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921 (owner: 10Andrew Bogott)
[17:18:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, open question for the path" [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff)
[17:18:40] <wikibugs>	 (03PS3) 10Andrew Bogott: rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921 (https://phabricator.wikimedia.org/T356621)
[17:19:13] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans)
[17:22:46] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts sessionstore[1001-1003].eqiad.wmnet
[17:24:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921 (https://phabricator.wikimedia.org/T356621) (owner: 10Andrew Bogott)
[17:25:38] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: dc=codfw,service=kubesvc,name=mw.*
[17:26:31] <_joe_>	 ok, now pooling them
[17:26:35] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,service=kubesvc,name=mw.*
[17:26:52] <claime>	 doing the same on eqiad
[17:27:31] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/weight=10; selector: name=mw.*,dc=eqiad,cluster=kubernetes,service=kubesvc
[17:27:39] <claime>	 pooling now
[17:27:44] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10Eevans) 05Open→03Resolved The RAID has been rebuilt, let's hope 3rd time is the charm!
[17:27:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P56356 and previous config saved to /var/cache/conftool/dbconfig/20240206-172747-marostegui.json
[17:27:55] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=mw.*,dc=eqiad,cluster=kubernetes,service=kubesvc
[17:30:29] <wikibugs>	 (03PS1) 10Eevans: site.pp: remove EOL sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/997928 (https://phabricator.wikimedia.org/T353405)
[17:31:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Do not add env variables when they're empty [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997873 (https://phabricator.wikimedia.org/T356780)
[17:33:30] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.dns.netbox
[17:35:27] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sessionstore[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[17:35:59] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10akosiaris) Let's see how I can be of help.  > what branch is used to build the package  It's configurable in gbp, but the default workflow assumes that the code from...
[17:36:40] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sessionstore[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[17:36:40] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:36:40] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sessionstore[1001-1003].eqiad.wmnet
[17:37:46] <claime>	 !log rebooting kubernetes2010.codfw.wmnet
[17:37:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:32] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] site.pp: remove EOL sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/997928 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans)
[17:42:13] <wikibugs>	 10ops-eqiad, 10Cassandra, 10decommission-hardware: Decommission sessionstore100[1-3] - https://phabricator.wikimedia.org/T356719 (10Eevans) a:05Eevans→03None
[17:42:40] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2073-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:42:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P56357 and previous config saved to /var/cache/conftool/dbconfig/20240206-174253-marostegui.json
[17:43:14] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reboot-single for host kubernetes2010.codfw.wmnet
[17:43:50] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:44:10] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:52] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye
[17:48:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:51:49] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2010.codfw.wmnet
[17:52:52] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:52:58] <wikibugs>	 (03PS12) 10Bking: sre.hosts.reimage: Suggest install-console for troubleshooting [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778)
[17:53:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:56:40] <kamila_>	 !log wikikube: cordon nodes added earlier today in codfw
[17:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:55] <wikibugs>	 (03PS1) 10Bking: cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617)
[17:57:12] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[17:57:52] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:58:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T355609)', diff saved to https://phabricator.wikimedia.org/P56358 and previous config saved to /var/cache/conftool/dbconfig/20240206-175800-marostegui.json
[17:58:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[17:58:05] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[17:58:07] <claime>	 !log uncordoning kubernetes2010
[17:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[17:58:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T355609)', diff saved to https://phabricator.wikimedia.org/P56359 and previous config saved to /var/cache/conftool/dbconfig/20240206-175822-marostegui.json
[17:59:06] <kamila_>	 !log wikikube codfw: drain newly added nodes
[17:59:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:29] <jinxer-wm>	 (ProbeDown) firing: (2) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1800)
[18:00:48] <wikibugs>	 (03PS2) 10Bking: cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617)
[18:00:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:01:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:01:42] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:41] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:03:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:05:16] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:05:20] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:05:51] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:06:00] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:06:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T355609)', diff saved to https://phabricator.wikimedia.org/P56360 and previous config saved to /var/cache/conftool/dbconfig/20240206-180641-marostegui.json
[18:06:45] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[18:09:11] <wikibugs>	 (03PS1) 10Btullis: [DPE Postgres] Only backup the latest postgres dump file [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655)
[18:09:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997873 (https://phabricator.wikimedia.org/T356780) (owner: 10Giuseppe Lavagetto)
[18:12:00] <wikibugs>	 (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1285/console" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:12:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic2037-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[18:13:16] <kamila_>	 !log wikikube codfw: belated homer commit of new nodes
[18:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:34] <inflatador>	 :eyes on elastic alert above
[18:14:44] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:17:35] <wikibugs>	 (03PS4) 10Eevans: Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469)
[18:17:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2037-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[18:18:53] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Looks good 👍" [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis)
[18:20:00] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans)
[18:20:53] <kamila_>	 !log wikikube codfw: uncordon new nodes
[18:20:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P56362 and previous config saved to /var/cache/conftool/dbconfig/20240206-182148-marostegui.json
[18:22:28] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts restbase[2013-2020].codfw.wmnet
[18:24:06] <wikibugs>	 (03PS1) 10Eevans: site.pp: remove decommissioned restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997941 (https://phabricator.wikimedia.org/T352469)
[18:27:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 7 hosts with reason: T355860
[18:27:19] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[18:27:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 7 hosts with reason: T355860
[18:28:06] <wikibugs>	 (03PS3) 10Scott French: systemd::unit: clean up ownership file [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054)
[18:28:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:29:24] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:30:32] <wikibugs>	 (03Merged) 10jenkins-bot: Do not add env variables when they're empty [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997873 (https://phabricator.wikimedia.org/T356780) (owner: 10Giuseppe Lavagetto)
[18:30:56] <logmsgbot>	 !log oblivian@deploy2002 Started scap: Backport for [[gerrit:997873|Do not add env variables when they're empty (T356780)]]
[18:31:01] <stashbot>	 T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780
[18:31:46] <wikibugs>	 (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1286/console" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:32:38] <logmsgbot>	 !log oblivian@deploy2002 oblivian: Backport for [[gerrit:997873|Do not add env variables when they're empty (T356780)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:36:00] <logmsgbot>	 !log oblivian@deploy2002 oblivian: Continuing with sync
[18:36:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P56363 and previous config saved to /var/cache/conftool/dbconfig/20240206-183654-marostegui.json
[18:38:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:42:12] <_joe_>	 uhm
[18:42:16] <_joe_>	 still firing
[18:42:52] <wikibugs>	 (03CR) 10Scott French: "Thanks for taking a look, Moritz. I was actually going to add you as a reviewer, as I saw you originally reviewed [0]." [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[18:42:53] <logmsgbot>	 !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:997873|Do not add env variables when they're empty (T356780)]] (duration: 11m 57s)
[18:43:09] <stashbot>	 T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780
[18:43:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (25) httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:43:40] <jinxer-wm>	 (SystemdUnitFailed) firing: (25) httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:44:10] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:44:26] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[18:45:42] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.dns.netbox
[18:46:56] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:47:47] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[2013-2020].codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[18:48:26] <jinxer-wm>	 (SystemdUnitFailed) resolved: (45) httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:48:50] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[2013-2020].codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[18:48:50] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:48:51] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase[2013-2020].codfw.wmnet
[18:49:53] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[18:50:22] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] site.pp: remove decommissioned restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997941 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans)
[18:50:49] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[18:52:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T355609)', diff saved to https://phabricator.wikimedia.org/P56364 and previous config saved to /var/cache/conftool/dbconfig/20240206-185201-marostegui.json
[18:52:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[18:52:06] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[18:52:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[18:52:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T355609)', diff saved to https://phabricator.wikimedia.org/P56365 and previous config saved to /var/cache/conftool/dbconfig/20240206-185223-marostegui.json
[18:52:31] <wikibugs>	 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase20[13-20] - https://phabricator.wikimedia.org/T356695 (10Eevans)
[19:00:05] <jouncebot>	 brennen and dancy: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1900).
[19:00:19] <brennen>	 o/
[19:00:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T355609)', diff saved to https://phabricator.wikimedia.org/P56366 and previous config saved to /var/cache/conftool/dbconfig/20240206-190037-marostegui.json
[19:00:50] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[19:06:51] <brennen>	 !log train 1.42.0-wmf.17: considering unblocked for group0, rolling forward.
[19:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:33] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997945 (https://phabricator.wikimedia.org/T354435)
[19:07:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997945 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot)
[19:08:26] <wikibugs>	 (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1289/console" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:10:58] <dancy>	 o/
[19:13:07] <dancy>	 Many `.17 e/C/i/J/JobTraits:92  Received cirrusSearchElasticaWrite job for an unwritable cluster cloudelastic.`
[19:13:13] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997945 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot)
[19:13:14] <dancy>	 brennen:
[19:13:26] <dancy>	 although they're trailing off.
[19:13:53] <inflatador>	 dancy what was the timeline for those errors? We've been migrating cloudelastic to private IPs
[19:13:53] <dancy>	 and now gone. :-)
[19:14:07] <dancy>	 I was looking at last 15 minutes..  and then they went away
[19:14:41] <dancy>	 inflatador: Thanks for the info!
[19:14:54] <dancy>	 Crisis averted. :-)
[19:15:15] <inflatador>	 sure, more context in T355617 if interested. No impact expected, but that cluster doesn't have a lot of redundancy ;(
[19:15:16] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[19:15:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P56367 and previous config saved to /var/cache/conftool/dbconfig/20240206-191544-marostegui.json
[19:16:32] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 400 bytes in 0.716 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:17:17] <brennen>	 dancy: thanks for ping.  i'd seen that spike and assumed it was likely something transient but didn't dig in.
[19:17:46] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1021 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:18:33] <inflatador>	 dancy if you can give a timeline (or a place to look for a timeline) I can re-queue the writes that failed
[19:18:55] <dancy>	 inflatador: Sure. Stand by.
[19:20:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-phpfpm-statustext-textfile.service Failed on mw1406:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:20:29] <brennen>	 also now seeing a bunch of "Received cirrusSearchCheckerJob job for an unwritable cluster default"
[19:21:10] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002"
[19:21:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.eqiad.wmnet with OS bullseye
[19:21:17] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.17  refs T354435
[19:21:22] <stashbot>	 T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435
[19:22:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "finish cloudelastic1009 private IP migration - bking@cumin2002 - T355617"
[19:22:25] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[19:23:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "finish cloudelastic1009 private IP migration - bking@cumin2002 - T355617"
[19:23:58] <inflatador>	 ebernhardson ^^ any opinion on those "Received cirrusSearchCheckerJob job for an unwritable cluster default" errors?
[19:25:26] <jinxer-wm>	 (SystemdUnitFailed) resolved: (46) prometheus-phpfpm-statustext-textfile.service Failed on mw1350:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:27:17] <wikibugs>	 (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1290/console" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:30:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P56368 and previous config saved to /var/cache/conftool/dbconfig/20240206-193052-marostegui.json
[19:31:34] <brennen>	 ~186 of those.
[19:35:26] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@718fc41]: Regular analytics weekly train [analytics/refinery@718fc417]
[19:42:04] <wikibugs>	 (03PS3) 10Bking: cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617)
[19:42:22] <wikibugs>	 (03CR) 10Muehlenhoff: "If you want to take a server out of production use for some time before the eventual decom, the insetup::foo roles should be used." [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans)
[19:43:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French)
[19:43:11] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:43:39] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:45:42] <inflatador>	 dancy quick update re: cloudelastic errors. Based on convo w e-bernhardson they are nothing to worry about...cloudelastic is supposed to be read-only from the normal jobrunner pipeline ATM
[19:45:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T355609)', diff saved to https://phabricator.wikimedia.org/P56370 and previous config saved to /var/cache/conftool/dbconfig/20240206-194558-marostegui.json
[19:46:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[19:46:03] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[19:46:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[19:46:17] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:46:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:46:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T355609)', diff saved to https://phabricator.wikimedia.org/P56371 and previous config saved to /var/cache/conftool/dbconfig/20240206-194639-marostegui.json
[19:47:43] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@718fc41]: Regular analytics weekly train [analytics/refinery@718fc417] (duration: 12m 17s)
[19:49:29] <brennen>	 thx inflatador.
[19:49:39] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@718fc41] (thin): Regular analytics weekly train THIN [analytics/refinery@718fc417]
[19:49:45] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@718fc41] (thin): Regular analytics weekly train THIN [analytics/refinery@718fc417] (duration: 00m 06s)
[19:49:58] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@718fc41] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@718fc417]
[19:50:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "admin: remove ssh key of Connie Chen" [puppet] - 10https://gerrit.wikimedia.org/r/997952 (https://phabricator.wikimedia.org/T356645)
[19:52:02] <wikibugs>	 (03PS2) 10Jdlrobson: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728)
[19:52:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "admin: remove ssh key of Connie Chen" [puppet] - 10https://gerrit.wikimedia.org/r/997952 (https://phabricator.wikimedia.org/T356645) (owner: 10Muehlenhoff)
[19:52:12] <wikibugs>	 (03PS1) 10Jdlrobson: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997874 (https://phabricator.wikimedia.org/T356728)
[19:53:32] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@718fc41] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@718fc417] (duration: 03m 33s)
[19:55:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert: admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/997953 (https://phabricator.wikimedia.org/T356645)
[19:55:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T355609)', diff saved to https://phabricator.wikimedia.org/P56372 and previous config saved to /var/cache/conftool/dbconfig/20240206-195532-marostegui.json
[19:55:42] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[19:56:33] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@93fa570]: (no justification provided)
[19:57:02] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@93fa570]: (no justification provided) (duration: 00m 28s)
[19:59:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert: admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/997953 (https://phabricator.wikimedia.org/T356645) (owner: 10Muehlenhoff)
[20:04:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10MoritzMuehlenhoff) - The SSH key was reinstated, the changes roll out across the next 30 minutes. - The POSIX groups were readded, the changes roll out...
[20:07:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic
[20:07:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic
[20:10:32] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@5f38647]: (no justification provided)
[20:10:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56373 and previous config saved to /var/cache/conftool/dbconfig/20240206-201039-marostegui.json
[20:10:59] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@5f38647]: (no justification provided) (duration: 00m 27s)
[20:21:11] <logmsgbot>	 !log joal@deploy2002 Started deploy [airflow-dags/analytics@09b8dc5]: Regular analytics weekly train [airflow-dags/analytics@09b8dc55]
[20:21:39] <logmsgbot>	 !log joal@deploy2002 Finished deploy [airflow-dags/analytics@09b8dc5]: Regular analytics weekly train [airflow-dags/analytics@09b8dc55] (duration: 00m 28s)
[20:22:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove memcached cruft from codfw1dev cloudservice nodes [puppet] - 10https://gerrit.wikimedia.org/r/997554 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott)
[20:25:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56374 and previous config saved to /var/cache/conftool/dbconfig/20240206-202546-marostegui.json
[20:27:40] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1009.eqiad.wmnet
[20:27:45] <wikibugs>	 (03PS2) 10Andrew Bogott: Removed refs to openstack version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/997538
[20:27:55] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=cloudelastic,name=cloudelastic1009.eqiad.wmnet
[20:28:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) Thanks so much @MoritzMuehlenhoff!!
[20:40:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T355609)', diff saved to https://phabricator.wikimedia.org/P56375 and previous config saved to /var/cache/conftool/dbconfig/20240206-204053-marostegui.json
[20:40:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[20:40:57] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[20:41:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[20:41:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T355609)', diff saved to https://phabricator.wikimedia.org/P56376 and previous config saved to /var/cache/conftool/dbconfig/20240206-204115-marostegui.json
[20:51:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T355609)', diff saved to https://phabricator.wikimedia.org/P56377 and previous config saved to /var/cache/conftool/dbconfig/20240206-205101-marostegui.json
[20:51:09] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[20:59:34] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T2100).
[21:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:57] <cjming>	 Jdlrobson: if you're around, I can deploy your patches
[21:02:36] <wikibugs>	 (03PS1) 10Majavah: WebRequest: Fix default for backwards compat [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997876 (https://phabricator.wikimedia.org/T356800)
[21:05:52] <Jdlrobson>	 present cjming 
[21:06:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P56378 and previous config saved to /var/cache/conftool/dbconfig/20240206-210607-marostegui.json
[21:07:08] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@916bff2]: (no justification provided)
[21:07:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson)
[21:07:37] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@916bff2]: (no justification provided) (duration: 00m 29s)
[21:09:01] <wikibugs>	 10SRE, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10cmooney) Seems reasonable.  There are some good reasons not to go too far (reducing load both our side and for recursive servers on the internet), but 5 mins seems ok to me.
[21:18:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Removed refs to openstack version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/997538 (owner: 10Andrew Bogott)
[21:21:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P56379 and previous config saved to /var/cache/conftool/dbconfig/20240206-212114-marostegui.json
[21:22:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) 05Open→03Resolved a:03cmooney Closing task, all looks good following change.  Big thanks to @Jhancock....
[21:22:25] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney)
[21:28:41] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson)
[21:29:08] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:997282|Reduce font size of diff heading (T356728)]]
[21:29:12] <stashbot>	 T356728: Regression: Font size increased on diff pages - https://phabricator.wikimedia.org/T356728
[21:30:37] <logmsgbot>	 !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:997282|Reduce font size of diff heading (T356728)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:30:40] <cjming>	 Jdlrobson: wanna test 1st patch?
[21:30:43] <Jdlrobson>	 yep
[21:30:46] <Jdlrobson>	 wmf16?
[21:30:49] <cjming>	 yes
[21:31:04] <Jdlrobson>	 cjming: yep that did 
[21:31:04] <Jdlrobson>	 it
[21:31:06] <Jdlrobson>	 please sync :)
[21:31:11] <cjming>	 will do
[21:31:14] <logmsgbot>	 !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync
[21:31:16] <Jdlrobson>	 The other one you can also sync - no need to test as it's not live yet.
[21:31:25] <cjming>	 alrighty
[21:31:40] <Jdlrobson>	 Thank you :)
[21:32:14] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997874 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson)
[21:32:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:35:54] <wikibugs>	 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) a:03jhathaway
[21:36:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T355609)', diff saved to https://phabricator.wikimedia.org/P56380 and previous config saved to /var/cache/conftool/dbconfig/20240206-213621-marostegui.json
[21:36:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[21:36:30] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[21:36:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[21:37:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) prometheus-phpfpm-statustext-textfile.service Failed on mw1384:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:37:45] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:997282|Reduce font size of diff heading (T356728)]] (duration: 08m 37s)
[21:37:49] <stashbot>	 T356728: Regression: Font size increased on diff pages - https://phabricator.wikimedia.org/T356728
[21:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997874 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson)
[21:39:27] <wikibugs>	 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) 05Open→03Resolved @bcampbell I assume this is resolved, please go...
[21:42:26] <jinxer-wm>	 (SystemdUnitFailed) resolved: (34) prometheus-phpfpm-statustext-textfile.service Failed on mw1364:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:47:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[21:47:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[21:52:25] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997874 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson)
[21:52:49] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:997874|Reduce font size of diff heading (T356728)]]
[21:52:55] <stashbot>	 T356728: Regression: Font size increased on diff pages - https://phabricator.wikimedia.org/T356728
[21:54:16] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995)
[21:54:17] <logmsgbot>	 !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:997874|Reduce font size of diff heading (T356728)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:54:45] <logmsgbot>	 !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync
[21:55:26] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "We should probably investigate what's broken, but I'm ok with merging if this fixes the issue. Please create a task to track this in Phab." [puppet] - 10https://gerrit.wikimedia.org/r/994250 (owner: 10Majavah)
[21:56:44] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott)
[21:57:41] <jinxer-wm>	 (SystemdUnitFailed) firing: (33) prometheus-phpfpm-statustext-textfile.service Failed on mw1364:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:58:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott)
[21:59:31] <wikibugs>	 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10bcampbell) @jhathaway Sorry for not closing the loop on this one. It is resolved...
[22:00:16] <Jdlrobson>	 thank you cjming 
[22:00:48] <jinxer-wm>	 (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:00:53] <cjming>	 yw! wmf17 patch live soon
[22:01:14] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:997874|Reduce font size of diff heading (T356728)]] (duration: 08m 24s)
[22:01:30] <stashbot>	 T356728: Regression: Font size increased on diff pages - https://phabricator.wikimedia.org/T356728
[22:01:52] <cjming>	 !log end of UTC late backport window
[22:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:04] <wikibugs>	 (03PS2) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995)
[22:02:40] <jinxer-wm>	 (SystemdUnitFailed) firing: (32) prometheus-phpfpm-statustext-textfile.service Failed on mw1370:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:02:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (33) prometheus-phpfpm-statustext-textfile.service Failed on mw1370:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:06:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott)
[22:07:10] <wikibugs>	 (03PS3) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995)
[22:07:41] <jinxer-wm>	 (SystemdUnitFailed) resolved: (33) prometheus-phpfpm-statustext-textfile.service Failed on mw1370:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:08:25] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott)
[22:14:14] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:17:41] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:27:05] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@11e5c60]: (no justification provided)
[22:27:33] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@11e5c60]: (no justification provided) (duration: 00m 28s)
[22:35:53] <wikibugs>	 (03PS1) 10Jforrester: Fix PermissionException being logged [extensions/Flow] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997877 (https://phabricator.wikimedia.org/T356223)
[22:36:08] <wikibugs>	 (03PS1) 10Jforrester: Fix PermissionException being logged [extensions/Flow] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997878 (https://phabricator.wikimedia.org/T356223)
[22:42:54] <brennen>	 jouncebot nowandnext
[22:42:54] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 17 minute(s)
[22:42:54] <jouncebot>	 In 8 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0700)
[22:43:41] <wikibugs>	 (03PS1) 10Jforrester: Set the memory limit in bytes. [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997879 (https://phabricator.wikimedia.org/T356780)
[22:47:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997876 (https://phabricator.wikimedia.org/T356800) (owner: 10Majavah)
[23:08:41] <wikibugs>	 (03Merged) 10jenkins-bot: WebRequest: Fix default for backwards compat [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997876 (https://phabricator.wikimedia.org/T356800) (owner: 10Majavah)
[23:09:05] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:997876|WebRequest: Fix default for backwards compat (T356800)]]
[23:09:09] <stashbot>	 T356800: ArgumentCountError: Too few arguments to function MediaWiki\Request\WebRequest::getRequestPathSuffix() - https://phabricator.wikimedia.org/T356800
[23:10:36] <logmsgbot>	 !log brennen@deploy2002 taavi and brennen: Backport for [[gerrit:997876|WebRequest: Fix default for backwards compat (T356800)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:11:21] <brennen>	 appears to fix officewiki image glitches.
[23:11:41] <logmsgbot>	 !log brennen@deploy2002 taavi and brennen: Continuing with sync
[23:12:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:17:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) prometheus-phpfpm-statustext-textfile.service Failed on mw1355:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:18:07] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:997876|WebRequest: Fix default for backwards compat (T356800)]] (duration: 09m 02s)
[23:18:11] <stashbot>	 T356800: ArgumentCountError: Too few arguments to function MediaWiki\Request\WebRequest::getRequestPathSuffix() - https://phabricator.wikimedia.org/T356800
[23:22:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (36) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:24:10] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@034ea4b]: (no justification provided)
[23:24:38] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@034ea4b]: (no justification provided) (duration: 00m 27s)