[00:26:22] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[1028-1033].eqiad.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [00:26:26] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [00:29:07] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2021-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997500 [00:38:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997500 (owner: 10TrainBranchBot) [00:41:12] (03PS1) 10Eevans: sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) [00:42:32] (03CR) 10CI reject: [V: 04-1] sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans) [00:45:06] (03PS2) 10Eevans: sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) [01:04:29] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997500 (owner: 10TrainBranchBot) [01:08:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T356726 (10phaultfinder) [01:18:52] !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase[2021-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [01:18:57] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [01:19:49] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2026-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [01:28:47] (03CR) 10Andrew Bogott: [C: 03+2] Allow pdns to query designate-mdns on private interfaces [puppet] - 10https://gerrit.wikimedia.org/r/997597 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [01:36:17] (03PS1) 10Jdlrobson: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728) [01:49:19] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:55:23] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:57:09] (03CR) 10CI reject: [V: 04-1] Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson) [02:00:23] !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase[2026-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [02:00:36] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [02:03:19] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2030-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [02:34:53] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [02:39:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Jhancock.wm) [02:50:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0300) [03:00:24] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[2030-2035].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [03:00:39] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [03:07:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.17 [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997501 (https://phabricator.wikimedia.org/T354435) [03:07:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.17 [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997501 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot) [03:09:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:26:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.17 [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997501 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot) [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0400) [04:02:12] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.14 (duration: 02m 07s) [04:03:29] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997631 (https://phabricator.wikimedia.org/T354435) [04:03:31] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997631 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot) [04:04:15] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997631 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot) [04:04:45] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.17 refs T354435 [04:04:52] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [04:10:15] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:14:47] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:55:47] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.17 refs T354435 (duration: 51m 02s) [04:55:51] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [05:11:46] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:58:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1029 T351916', diff saved to https://phabricator.wikimedia.org/P56283 and previous config saved to /var/cache/conftool/dbconfig/20240206-055835-root.json [05:58:40] T351916: Migrate es1 to Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351916 [05:59:13] (03PS1) 10Marostegui: es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/997636 (https://phabricator.wikimedia.org/T351916) [06:00:39] (03CR) 10Marostegui: [C: 03+2] es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/997636 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui) [06:01:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1029.eqiad.wmnet with OS bookworm [06:02:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [06:02:33] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [06:02:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [06:05:37] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [06:06:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:06:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:09:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P56284 and previous config saved to /var/cache/conftool/dbconfig/20240206-060942-root.json [06:10:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [06:10:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [06:10:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:11:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:11:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T355609)', diff saved to https://phabricator.wikimedia.org/P56285 and previous config saved to /var/cache/conftool/dbconfig/20240206-061116-marostegui.json [06:11:20] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:17:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T355609)', diff saved to https://phabricator.wikimedia.org/P56286 and previous config saved to /var/cache/conftool/dbconfig/20240206-061709-marostegui.json [06:23:53] PROBLEM - MariaDB Replica Lag: s1 on db1186 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:27:37] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on db1186 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1094.52 seconds Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:32:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P56287 and previous config saved to /var/cache/conftool/dbconfig/20240206-063215-marostegui.json [06:37:47] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1029.eqiad.wmnet with OS bookworm [06:38:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1029.eqiad.wmnet with OS bullseye [06:47:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P56288 and previous config saved to /var/cache/conftool/dbconfig/20240206-064722-marostegui.json [06:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0700) [07:00:05] kormat, marostegui, Amir1, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0700). [07:02:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T355609)', diff saved to https://phabricator.wikimedia.org/P56289 and previous config saved to /var/cache/conftool/dbconfig/20240206-070228-marostegui.json [07:02:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [07:02:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:02:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [07:02:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T355609)', diff saved to https://phabricator.wikimedia.org/P56290 and previous config saved to /var/cache/conftool/dbconfig/20240206-070251-marostegui.json [07:07:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T355609)', diff saved to https://phabricator.wikimedia.org/P56291 and previous config saved to /var/cache/conftool/dbconfig/20240206-070708-marostegui.json [07:22:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P56292 and previous config saved to /var/cache/conftool/dbconfig/20240206-072215-marostegui.json [07:25:15] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1029.eqiad.wmnet with OS bullseye [07:31:29] (03PS2) 10Hoo man: Add wgVirtualDomainsMapping for Cognate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) [07:37:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P56293 and previous config saved to /var/cache/conftool/dbconfig/20240206-073721-marostegui.json [07:37:31] (03CR) 10Hoo man: Add wgVirtualDomainsMapping for Cognate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [07:52:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T355609)', diff saved to https://phabricator.wikimedia.org/P56294 and previous config saved to /var/cache/conftool/dbconfig/20240206-075228-marostegui.json [07:52:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [07:52:33] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:52:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [07:52:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T355609)', diff saved to https://phabricator.wikimedia.org/P56295 and previous config saved to /var/cache/conftool/dbconfig/20240206-075251-marostegui.json [07:56:39] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [07:56:54] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [07:57:14] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [07:57:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T355609)', diff saved to https://phabricator.wikimedia.org/P56296 and previous config saved to /var/cache/conftool/dbconfig/20240206-075733-marostegui.json [07:57:37] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:57:47] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [08:00:04] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:00:27] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [08:00:54] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [08:06:42] !log hoo@deploy2002 backport Cancelled [08:07:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [08:07:04] (03Abandoned) 10Filippo Giunchedi: Deprecate nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/924901 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [08:07:42] (03Merged) 10jenkins-bot: Add wgVirtualDomainsMapping for Cognate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [08:08:46] !log hoo@deploy2002 Started scap: Backport for [[gerrit:994922|Add wgVirtualDomainsMapping for Cognate (T348526)]] [08:08:50] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:09:54] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:10:35] !log hoo@deploy2002 hoo: Backport for [[gerrit:994922|Add wgVirtualDomainsMapping for Cognate (T348526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:10:48] (03CR) 10Filippo Giunchedi: "I'm +1 on the idea, not voting yet pending https://gerrit.wikimedia.org/r/c/operations/alerts/+/997253?usp=dashboard" [puppet] - 10https://gerrit.wikimedia.org/r/994735 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:11:13] (03PS2) 10Arnaudb: mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) [08:11:16] !log hoo@deploy2002 hoo: Continuing with sync [08:11:22] (03CR) 10Arnaudb: mariadb: will test converting instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:12:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P56297 and previous config saved to /var/cache/conftool/dbconfig/20240206-081239-marostegui.json [08:14:39] (03CR) 10Filippo Giunchedi: "I believe this can be abandoned at this point ?" [puppet] - 10https://gerrit.wikimedia.org/r/990166 (https://phabricator.wikimedia.org/T354904) (owner: 10Cwhite) [08:16:33] (03CR) 10Muehlenhoff: systemd::unit: clean up ownership file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [08:17:38] !log hoo@deploy2002 Finished scap: Backport for [[gerrit:994922|Add wgVirtualDomainsMapping for Cognate (T348526)]] (duration: 08m 51s) [08:17:42] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:20:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete scap config for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/997483 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [08:21:02] RECOVERY - MariaDB Replica Lag: s1 on db1186 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:22:29] Hi, is deployment window still running? [08:23:25] *UTC morning backport [08:27:12] Amir1, urbanecm? [08:27:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P56298 and previous config saved to /var/cache/conftool/dbconfig/20240206-082746-marostegui.json [08:27:48] there was no patch in it AFAICS [08:28:20] No, namespaceDupes has to be run on srwiki, and I'm hoping that we can finally do it. [08:28:51] I didn't want to add it in the calendar because I thinked that UTC backport window is "marked as finished". [08:28:58] There is a task. https://phabricator.wikimedia.org/T350431 [08:30:11] This patch is live so I'm guessing that we won't have surprises. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/995242 [08:32:37] !log pruning unneeded openjdk-17-jre-headless packages on ml-cache* hosts [08:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:25] (03PS18) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [08:41:18] (03PS2) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) [08:41:40] (03CR) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:42:06] (03CR) 10Slyngshede: [C: 03+2] SystemdUnitFailed: Increase the severity of a failed unit to critical. [alerts] - 10https://gerrit.wikimedia.org/r/997253 (owner: 10Slyngshede) [08:42:39] !log Increase severity of failed systemd units when alerting from AlertManager [08:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T355609)', diff saved to https://phabricator.wikimedia.org/P56299 and previous config saved to /var/cache/conftool/dbconfig/20240206-084253-marostegui.json [08:42:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance [08:42:57] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:43:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance [08:43:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T355609)', diff saved to https://phabricator.wikimedia.org/P56300 and previous config saved to /var/cache/conftool/dbconfig/20240206-084315-marostegui.json [08:43:43] (03Merged) 10jenkins-bot: SystemdUnitFailed: Increase the severity of a failed unit to critical. [alerts] - 10https://gerrit.wikimedia.org/r/997253 (owner: 10Slyngshede) [08:46:24] (03PS10) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis) [08:47:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:25] (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:34] !log pruning unneeded openjdk-17-jre-headless packages on aqs* hosts [08:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T355609)', diff saved to https://phabricator.wikimedia.org/P56301 and previous config saved to /var/cache/conftool/dbconfig/20240206-084858-marostegui.json [08:49:03] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:50:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [08:52:08] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:25] (SystemdUnitFailed) resolved: prometheus-phpfpm-statustext-textfile.service Failed on mw2279:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:36] (03CR) 10Vgutierrez: fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [09:01:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P56302 and previous config saved to /var/cache/conftool/dbconfig/20240206-090102-root.json [09:01:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "nit inline, rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:01:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host build2001.codfw.wmnet [09:02:25] (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:42] (03PS1) 10Filippo Giunchedi: jaeger: route jaeger-query to oauth2-proxy port [deployment-charts] - 10https://gerrit.wikimedia.org/r/997789 (https://phabricator.wikimedia.org/T320555) [09:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P56303 and previous config saved to /var/cache/conftool/dbconfig/20240206-090405-marostegui.json [09:07:48] (03PS1) 10Arnaudb: mariadb: add db1235 to production [puppet] - 10https://gerrit.wikimedia.org/r/997503 (https://phabricator.wikimedia.org/T344036) [09:08:16] (03PS3) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) [09:12:16] (03PS1) 10Arnaudb: mariadb: toggle notifications for db1235 [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) [09:12:42] (03Abandoned) 10Arnaudb: mariadb: add db1235 to production [puppet] - 10https://gerrit.wikimedia.org/r/997503 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:13:43] (03PS1) 10Muehlenhoff: Remove obsolete scap config for Netbox/Homer [puppet] - 10https://gerrit.wikimedia.org/r/997790 [09:16:00] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/997790 (owner: 10Muehlenhoff) [09:16:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P56304 and previous config saved to /var/cache/conftool/dbconfig/20240206-091607-root.json [09:18:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete scap config for Netbox/Homer [puppet] - 10https://gerrit.wikimedia.org/r/997790 (owner: 10Muehlenhoff) [09:19:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P56305 and previous config saved to /var/cache/conftool/dbconfig/20240206-091911-marostegui.json [09:20:51] (03CR) 10Marostegui: "Green on icinga?" [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:21:09] (03CR) 10Arnaudb: "yep!" [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:21:14] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db1235 [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:21:29] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db1235 [puppet] - 10https://gerrit.wikimedia.org/r/997504 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:22:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 10%: 5', diff saved to https://phabricator.wikimedia.org/P56306 and previous config saved to /var/cache/conftool/dbconfig/20240206-092257-arnaudb.json [09:25:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/997484 (https://phabricator.wikimedia.org/T356409) (owner: 10Slyngshede) [09:26:22] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:26:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:03] (03PS2) 10Filippo Giunchedi: jaeger: route trace.w.o to jaeger-query [deployment-charts] - 10https://gerrit.wikimedia.org/r/997789 (https://phabricator.wikimedia.org/T320555) [09:30:25] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Make it clear what password is being reset [software/bitu] - 10https://gerrit.wikimedia.org/r/997484 (https://phabricator.wikimedia.org/T356409) (owner: 10Slyngshede) [09:30:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:54] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:31:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P56307 and previous config saved to /var/cache/conftool/dbconfig/20240206-093112-root.json [09:33:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [09:34:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T355609)', diff saved to https://phabricator.wikimedia.org/P56308 and previous config saved to /var/cache/conftool/dbconfig/20240206-093418-marostegui.json [09:34:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1213.eqiad.wmnet with reason: Maintenance [09:34:22] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:34:33] (03PS1) 10Brouberol: Allow pods in the dse k8s cluster to reach an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/997792 (https://phabricator.wikimedia.org/T356623) [09:34:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1213.eqiad.wmnet with reason: Maintenance [09:34:35] (03PS1) 10Brouberol: Allow pods in the dse k8s cluster to reach an-druid [puppet] - 10https://gerrit.wikimedia.org/r/997793 (https://phabricator.wikimedia.org/T356623) [09:34:37] (03PS1) 10Brouberol: Allow pods in the dse k8s cluster to reach public-druid [puppet] - 10https://gerrit.wikimedia.org/r/997794 (https://phabricator.wikimedia.org/T356623) [09:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56309 and previous config saved to /var/cache/conftool/dbconfig/20240206-093440-marostegui.json [09:37:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [09:38:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 20%: 5', diff saved to https://phabricator.wikimedia.org/P56310 and previous config saved to /var/cache/conftool/dbconfig/20240206-093803-arnaudb.json [09:39:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56311 and previous config saved to /var/cache/conftool/dbconfig/20240206-093925-marostegui.json [09:39:29] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:39:52] (03PS11) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis) [09:43:28] (03PS19) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [09:45:04] (03PS20) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [09:46:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P56312 and previous config saved to /var/cache/conftool/dbconfig/20240206-094617-root.json [09:47:26] !log roll restart all pods in wikikube@eqiad [09:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:53:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 30%: 5', diff saved to https://phabricator.wikimedia.org/P56313 and previous config saved to /var/cache/conftool/dbconfig/20240206-095308-arnaudb.json [09:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P56314 and previous config saved to /var/cache/conftool/dbconfig/20240206-095432-marostegui.json [09:56:32] !log installing mariadb-10.5 security/bugfix updates from Bullseye point release (as packaged by Debian, unrelated to wmf-mariadb packages) [09:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P56315 and previous config saved to /var/cache/conftool/dbconfig/20240206-100123-root.json [10:01:27] (03CR) 10Majavah: [C: 03+2] wikireplicas: update-views: always run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) (owner: 10Majavah) [10:01:30] (03CR) 10Majavah: [C: 03+2] libraryupgrader: migrate repo to gitlab [puppet] - 10https://gerrit.wikimedia.org/r/997547 (https://phabricator.wikimedia.org/T341417) (owner: 10Majavah) [10:03:47] (03PS2) 10Majavah: systemd: timer_service: Move ConditionPathExists to correct section [puppet] - 10https://gerrit.wikimedia.org/r/992888 [10:06:10] (03Merged) 10jenkins-bot: wikireplicas: update-views: always run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) (owner: 10Majavah) [10:06:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1281/console" [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah) [10:07:06] (03CR) 10Majavah: [V: 03+1 C: 03+2] systemd: timer_service: Move ConditionPathExists to correct section [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah) [10:07:58] (03PS1) 10Slyngshede: P:docker::builder clean docker image cache regularly. [puppet] - 10https://gerrit.wikimedia.org/r/997796 [10:08:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 40%: 5', diff saved to https://phabricator.wikimedia.org/P56316 and previous config saved to /var/cache/conftool/dbconfig/20240206-100813-arnaudb.json [10:08:32] (03CR) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:09:22] (03CR) 10Slyngshede: [C: 03+2] D:prometheus::blackbox::check::tcp allow specifying runbook. [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P56317 and previous config saved to /var/cache/conftool/dbconfig/20240206-100938-marostegui.json [10:16:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P56319 and previous config saved to /var/cache/conftool/dbconfig/20240206-101628-root.json [10:20:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet [10:20:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1009.eqiad.wmnet [10:22:19] !log roll restart all pods in wikikube@codfw, wikikube@staging-codfw, wikikube@staging-eqiad [10:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 50%: 5', diff saved to https://phabricator.wikimedia.org/P56320 and previous config saved to /var/cache/conftool/dbconfig/20240206-102318-arnaudb.json [10:24:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56321 and previous config saved to /var/cache/conftool/dbconfig/20240206-102445-marostegui.json [10:24:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [10:24:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:25:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [10:29:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance [10:29:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance [10:29:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T355609)', diff saved to https://phabricator.wikimedia.org/P56322 and previous config saved to /var/cache/conftool/dbconfig/20240206-102932-marostegui.json [10:31:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P56323 and previous config saved to /var/cache/conftool/dbconfig/20240206-103133-root.json [10:33:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T355609)', diff saved to https://phabricator.wikimedia.org/P56324 and previous config saved to /var/cache/conftool/dbconfig/20240206-103341-marostegui.json [10:33:45] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:35:51] (03PS12) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis) [10:38:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 75%: 5', diff saved to https://phabricator.wikimedia.org/P56325 and previous config saved to /var/cache/conftool/dbconfig/20240206-103823-arnaudb.json [10:40:40] (03PS1) 10Btullis: Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) [10:41:51] (03CR) 10CI reject: [V: 04-1] Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [10:45:20] (03PS1) 10Btullis: Configure reuse-parts for the analytics webserver [puppet] - 10https://gerrit.wikimedia.org/r/997798 (https://phabricator.wikimedia.org/T349398) [10:45:37] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683]: (no justification provided) [10:45:59] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683]: (no justification provided) (duration: 00m 22s) [10:48:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P56326 and previous config saved to /var/cache/conftool/dbconfig/20240206-104848-marostegui.json [10:49:29] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:49:52] PROBLEM - kartotherian endpoints health on maps1009 is CRITICAL: /{src}/info.json (Get service info for osm-intl) is CRITICAL: Test Get service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [10:50:07] (03CR) 10Btullis: [C: 03+2] Configure reuse-parts for the analytics webserver [puppet] - 10https://gerrit.wikimedia.org/r/997798 (https://phabricator.wikimedia.org/T349398) (owner: 10Btullis) [10:53:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 100%: 5', diff saved to https://phabricator.wikimedia.org/P56327 and previous config saved to /var/cache/conftool/dbconfig/20240206-105328-arnaudb.json [10:57:48] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [10:57:53] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1100) [11:02:09] (03PS2) 10Btullis: Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) [11:03:00] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-web1001.eqiad.wmnet with OS bullseye [11:03:21] (03CR) 10CI reject: [V: 04-1] Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [11:03:49] btullis: FYI there is a possibility that the reimage gets stuck in debian-installer failing to get the proper netmask, we've got some failures yesterday and I'm looking at them, not sure yet if it affects all hosts [11:03:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P56328 and previous config saved to /var/cache/conftool/dbconfig/20240206-110354-marostegui.json [11:04:01] we did had a successful reimage yesterday too, so not sure [11:04:12] you can keep an eye on the mgmt console to see the progress [11:04:20] jouncebot: nowandnext [11:04:20] For the next 0 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1100) [11:04:20] In 1 hour(s) and 55 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1300) [11:05:12] volans: OK, thanks. I'll be on the lookout and report back. It's using reuse-parts-test, so I expect it to wait in the installer at the partman screen, but I'll let you know if it doesn't get that far. [11:05:41] thx [11:07:35] (03PS3) 10Btullis: Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) [11:10:46] (03PS1) 10Brouberol: superset: configure extra TLS SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/997799 (https://phabricator.wikimedia.org/T356482) [11:12:26] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [11:12:31] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s) [11:13:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host stat1010.eqiad.wmnet [11:13:39] volans: yes I think it failed with a red screen in the installer, having failed to download the preseed file. I went back and selected 'configure network' again and it had the correct values displayed. [11:14:33] I can re-run the cookbook if it would be helpful to you, or I'm happy to continue. It has now downloaded the preseed successfully. [11:15:01] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host stat1010.eqiad.wmnet [11:16:08] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:32] btullis: that's super weird, no worries, I've plenty of hosts to play with [11:17:19] Ack, I'll probably continue then. [11:17:25] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:36] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T355609)', diff saved to https://phabricator.wikimedia.org/P56329 and previous config saved to /var/cache/conftool/dbconfig/20240206-111901-marostegui.json [11:19:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1244.eqiad.wmnet with reason: Maintenance [11:19:05] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:19:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1244.eqiad.wmnet with reason: Maintenance [11:19:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56330 and previous config saved to /var/cache/conftool/dbconfig/20240206-111923-marostegui.json [11:19:54] (03PS2) 10Ladsgroup: Switch the pagelinks default to add read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237) [11:19:57] (03CR) 10Ladsgroup: [C: 03+2] Switch the pagelinks default to add read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [11:20:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [11:20:39] (03Merged) 10jenkins-bot: Switch the pagelinks default to add read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [11:20:59] marostegui: FYI, ^ most wikis are going read new on pagelinks [11:21:04] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:997420|Switch the pagelinks default to add read new (T351237)]] [11:21:10] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [11:22:15] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [11:22:19] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 04s) [11:22:37] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:997420|Switch the pagelinks default to add read new (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:25:09] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:25:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56331 and previous config saved to /var/cache/conftool/dbconfig/20240206-112514-marostegui.json [11:25:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:27:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mw2374:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:48] (03CR) 10Majavah: "netbox is using `service::uwsgi` which defaults to `deployment => 'scap3'` (which adds a `scap::target`), does that need updating?" [puppet] - 10https://gerrit.wikimedia.org/r/997790 (owner: 10Muehlenhoff) [11:30:17] !log volans@cumin1002 START - Cookbook sre.hosts.dhcp for host mw1408.eqiad.wmnet [11:31:43] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:997420|Switch the pagelinks default to add read new (T351237)]] (duration: 10m 38s) [11:31:47] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [11:32:25] (SystemdUnitFailed) firing: (22) prometheus-phpfpm-statustext-textfile.service Failed on mw1353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:44] (03PS1) 10Muehlenhoff: Fix matching block [puppet] - 10https://gerrit.wikimedia.org/r/997800 [11:37:06] (03PS1) 10Filippo Giunchedi: profile: remove Icinga-based systemd unit failed check [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) [11:37:25] (SystemdUnitFailed) resolved: (36) prometheus-phpfpm-statustext-textfile.service Failed on mw1353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:02] (03PS1) 10Filippo Giunchedi: profile: remove absented statsd hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862) [11:39:28] (03CR) 10Filippo Giunchedi: "Doesn't have to happen immediately" [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [11:39:31] (03PS1) 10Volans: installserver: fix typo in preseed [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709) [11:40:13] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-web1001.eqiad.wmnet with reason: host reimage [11:40:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3315', diff saved to https://phabricator.wikimedia.org/P56332 and previous config saved to /var/cache/conftool/dbconfig/20240206-114020-marostegui.json [11:40:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709) (owner: 10Volans) [11:40:51] (03CR) 10Stevemunene: [C: 03+2] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/993742 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene) [11:42:20] (03CR) 10Stevemunene: [C: 03+2] hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/993743 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene) [11:43:11] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-web1001.eqiad.wmnet with reason: host reimage [11:43:32] (03Abandoned) 10Muehlenhoff: Fix matching block [puppet] - 10https://gerrit.wikimedia.org/r/997800 (owner: 10Muehlenhoff) [11:44:13] (03CR) 10CI reject: [V: 04-1] installserver: fix typo in preseed [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709) (owner: 10Volans) [11:44:41] (03PS2) 10Volans: installserver: fix typo in preseed [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709) [11:45:04] (03PS1) 10Filippo Giunchedi: graphite: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831) [11:45:06] (03PS1) 10Filippo Giunchedi: cache: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831) [11:46:10] !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host mw1408.eqiad.wmnet [11:49:47] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-web1001.eqiad.wmnet with OS bullseye [11:49:59] (03CR) 10Volans: [C: 03+2] installserver: fix typo in preseed [puppet] - 10https://gerrit.wikimedia.org/r/997804 (https://phabricator.wikimedia.org/T356709) (owner: 10Volans) [11:52:48] (03CR) 10Vgutierrez: [C: 03+1] cache: remove nrpe::monitor_systemd_unit_state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:53:51] (03CR) 10Btullis: [C: 03+1] superset: configure extra TLS SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/997799 (https://phabricator.wikimedia.org/T356482) (owner: 10Brouberol) [11:55:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3315', diff saved to https://phabricator.wikimedia.org/P56334 and previous config saved to /var/cache/conftool/dbconfig/20240206-115527-marostegui.json [11:58:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1029.eqiad.wmnet with OS bookworm [11:59:18] PROBLEM - Check systemd state on an-worker1164 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:20] RECOVERY - Check systemd state on an-worker1164 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:24] PROBLEM - Check systemd state on an-worker1168 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:41] !log volans@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [12:05:38] RECOVERY - Check systemd state on an-worker1168 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:34] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56335 and previous config saved to /var/cache/conftool/dbconfig/20240206-121034-marostegui.json [12:10:36] ok it seems the reimage issues have been fixed, if you encounter new issues please let us know (context in T356709 ) [12:10:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance [12:10:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:10:39] T356709: Debian installer waits for input for network config during host reimage - https://phabricator.wikimedia.org/T356709 [12:10:48] volans: yeah, my reimage is working fine [12:10:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance [12:10:51] Thanks [12:10:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1029.eqiad.wmnet with reason: host reimage [12:10:57] nice [12:11:33] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [12:12:12] (03CR) 10Btullis: "Looks great. Couple of questions inline, but looks ready to go. At least for this iteration." [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [12:12:19] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [12:13:33] (03CR) 10Clément Goubert: [C: 03+1] P:httpbb: migrate tests from cumin1001 to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [12:13:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1029.eqiad.wmnet with reason: host reimage [12:14:01] (03CR) 10Clément Goubert: [C: 03+1] P:httpbb: clean up after move from cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/995109 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [12:14:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [12:14:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [12:15:44] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:34] PROBLEM - Check systemd state on an-worker1175 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:15] (03CR) 10Btullis: Add a deployment chart for Superset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [12:17:20] PROBLEM - Check systemd state on an-worker1167 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:58] PROBLEM - Check systemd state on an-worker1160 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:00] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1386.eqiad.wmnet with OS bullseye [12:18:20] PROBLEM - Check systemd state on an-worker1173 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:36] RECOVERY - Check systemd state on an-worker1175 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:40] !log volans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:19:18] PROBLEM - Check systemd state on an-worker1171 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:25] !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:21:30] RECOVERY - Check systemd state on an-worker1171 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:32] RECOVERY - Check systemd state on an-worker1173 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:28] PROBLEM - Hadoop DataNode on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [12:22:36] RECOVERY - Check systemd state on an-worker1167 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:28] RECOVERY - Hadoop DataNode on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [12:26:05] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1388.eqiad.wmnet with OS bullseye [12:27:13] (03PS1) 10Marostegui: Revert "es1029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/997778 [12:27:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/997506 (https://phabricator.wikimedia.org/T355907) (owner: 10Slyngshede) [12:28:36] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1390.eqiad.wmnet with OS bullseye [12:29:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1029.eqiad.wmnet with OS bookworm [12:31:11] (03PS1) 10Lucas Werkmeister: Load Filepage.css when previewing File pages [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) [12:31:37] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1386.eqiad.wmnet with reason: host reimage [12:32:06] PROBLEM - Check systemd state on an-worker1169 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:11] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:34:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1386.eqiad.wmnet with reason: host reimage [12:34:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:36:15] PROBLEM - Hadoop DataNode on an-worker1169 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [12:37:39] !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [12:39:36] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1388.eqiad.wmnet with reason: host reimage [12:39:45] (03PS3) 10Arnaudb: mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) [12:40:21] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1392.eqiad.wmnet with OS bullseye [12:40:28] (03PS1) 10Slyngshede: Add gitreview configuration [software/bitu] - 10https://gerrit.wikimedia.org/r/997809 (https://phabricator.wikimedia.org/T355180) [12:40:59] PROBLEM - Check systemd state on an-worker1170 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:16] (03CR) 10Marostegui: [C: 03+1] mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:41:52] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1394.eqiad.wmnet with OS bullseye [12:42:08] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1390.eqiad.wmnet with reason: host reimage [12:42:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1388.eqiad.wmnet with reason: host reimage [12:42:49] RECOVERY - Hadoop DataNode on an-worker1169 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [12:44:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1390.eqiad.wmnet with reason: host reimage [12:45:04] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [12:45:04] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1396.eqiad.wmnet with OS bullseye [12:45:09] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s) [12:46:30] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1408.eqiad.wmnet with OS bullseye [12:46:39] RECOVERY - Check systemd state on an-worker1169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:40] (03CR) 10Arnaudb: [C: 03+2] mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:48:45] RECOVERY - Check systemd state on an-worker1160 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:57] RECOVERY - Check systemd state on an-worker1170 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:17] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2317.codfw.wmnet with OS bullseye [12:50:13] (03PS1) 10Btullis: Fix the reuse-analytics-raid1-2dev partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/997810 (https://phabricator.wikimedia.org/T349398) [12:50:56] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [12:51:00] !log jgiannelos@deploy2002 deploy aborted: (no justification provided) (duration: 00m 04s) [12:51:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1386.eqiad.wmnet with OS bullseye [12:51:55] (03CR) 10Marostegui: [C: 03+2] Revert "es1029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/997778 (owner: 10Marostegui) [12:52:12] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2318.codfw.wmnet with OS bullseye [12:52:41] (03CR) 10Btullis: [C: 03+2] Fix the reuse-analytics-raid1-2dev partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/997810 (https://phabricator.wikimedia.org/T349398) (owner: 10Btullis) [12:53:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:53:57] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1392.eqiad.wmnet with reason: host reimage [12:54:43] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2319.codfw.wmnet with OS bullseye [12:54:50] !log installing openjdk-11 security updates [12:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:57] (03PS1) 10Slyngshede: Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) [12:55:38] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1394.eqiad.wmnet with reason: host reimage [12:56:08] (03CR) 10Slyngshede: "Adding Bryan as a reviewer as well, for input on logo swap." [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede) [12:56:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1392.eqiad.wmnet with reason: host reimage [12:57:10] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [12:57:15] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s) [12:58:17] !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM vrts1002.eqiad.wmnet [12:59:00] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1396.eqiad.wmnet with reason: host reimage [12:59:30] !log Pruning images older than 45 days on build2001: docker image prune -a --filter "until=1080h"/25 [12:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1394.eqiad.wmnet with reason: host reimage [12:59:48] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1408.eqiad.wmnet with reason: host reimage [12:59:51] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-web1001.eqiad.wmnet with OS bullseye [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1300) [13:00:46] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [13:00:47] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 01s) [13:00:53] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [13:00:54] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 01s) [13:01:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1396.eqiad.wmnet with reason: host reimage [13:02:25] !log build2001 - Total reclaimed space: 23.31GB [13:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:38] !log aokoth@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM vrts1002.eqiad.wmnet [13:02:53] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2350.codfw.wmnet with OS bullseye [13:03:23] !log Relaunching build-production-images [13:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1390.eqiad.wmnet with OS bullseye [13:04:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1408.eqiad.wmnet with reason: host reimage [13:05:44] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2352.codfw.wmnet with OS bullseye [13:06:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1388.eqiad.wmnet with OS bullseye [13:07:31] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] prometheus-php-fpm-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 (owner: 10Clément Goubert) [13:07:42] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2354.codfw.wmnet with OS bullseye [13:07:54] (03PS3) 10Volans: P:httpbb: migrate tests from cumin1001 to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [13:07:59] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [13:08:12] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:08:21] (03PS2) 10Filippo Giunchedi: cache: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831) [13:08:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2317.codfw.wmnet with reason: host reimage [13:08:50] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2318.codfw.wmnet with reason: host reimage [13:09:16] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] cache: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997807 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:09:44] (03PS1) 10Btullis: Fix the reuse-analytics-raid1-2dev recipe [puppet] - 10https://gerrit.wikimedia.org/r/997812 (https://phabricator.wikimedia.org/T349398) [13:10:11] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-web1001.eqiad.wmnet with OS bullseye [13:10:36] !log kamila@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1388.eqiad.wmnet [13:10:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1388.eqiad.wmnet [13:11:14] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2319.codfw.wmnet with reason: host reimage [13:11:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2318.codfw.wmnet with reason: host reimage [13:11:39] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2356.codfw.wmnet with OS bullseye [13:11:46] (03PS2) 10Filippo Giunchedi: graphite: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831) [13:11:48] (03PS1) 10Filippo Giunchedi: cassandra: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997814 (https://phabricator.wikimedia.org/T337831) [13:11:50] (03CR) 10Btullis: [C: 03+2] Fix the reuse-analytics-raid1-2dev recipe [puppet] - 10https://gerrit.wikimedia.org/r/997812 (https://phabricator.wikimedia.org/T349398) (owner: 10Btullis) [13:11:56] (03PS1) 10Filippo Giunchedi: confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) [13:12:00] (03PS1) 10Filippo Giunchedi: chartmuseum: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831) [13:12:04] (03PS1) 10Filippo Giunchedi: docker_registry: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831) [13:12:08] (03PS1) 10Filippo Giunchedi: envoy: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997818 (https://phabricator.wikimedia.org/T337831) [13:12:12] (03PS1) 10Filippo Giunchedi: mediawiki: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) [13:12:16] (03PS1) 10Filippo Giunchedi: etcd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) [13:12:20] (03PS1) 10Filippo Giunchedi: mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) [13:13:30] (03CR) 10Btullis: [C: 03+1] Allow pods in the dse k8s cluster to reach public-druid [puppet] - 10https://gerrit.wikimedia.org/r/997794 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [13:13:48] (03CR) 10Btullis: [C: 03+1] Allow pods in the dse k8s cluster to reach an-druid [puppet] - 10https://gerrit.wikimedia.org/r/997793 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [13:13:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2319.codfw.wmnet with reason: host reimage [13:14:07] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [13:14:12] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s) [13:14:21] (03CR) 10Brouberol: [C: 03+2] superset: configure extra TLS SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/997799 (https://phabricator.wikimedia.org/T356482) (owner: 10Brouberol) [13:15:01] (03CR) 10Btullis: [C: 03+1] "nit: In fact this role is applied to 3 hosts, an-coord100[1,3,4]." [puppet] - 10https://gerrit.wikimedia.org/r/997792 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [13:15:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1392.eqiad.wmnet with OS bullseye [13:15:37] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-web1001.eqiad.wmnet with OS bullseye [13:16:37] !log pruning unneeded openjdk-17-jre-headless packages on restbase* hosts [13:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1394.eqiad.wmnet with OS bullseye [13:18:24] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS bookworm [13:19:26] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2350.codfw.wmnet with reason: host reimage [13:20:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1396.eqiad.wmnet with OS bullseye [13:20:26] RECOVERY - Disk space on build2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [13:20:56] (03CR) 10Brouberol: [C: 03+2] Allow pods in the dse k8s cluster to reach an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/997792 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [13:20:57] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [13:21:03] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 05s) [13:21:56] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) [13:22:09] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): (no justification provided) (duration: 00m 12s) [13:22:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2350.codfw.wmnet with reason: host reimage [13:22:27] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2352.codfw.wmnet with reason: host reimage [13:22:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1408.eqiad.wmnet with OS bullseye [13:24:07] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2354.codfw.wmnet with reason: host reimage [13:25:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2352.codfw.wmnet with reason: host reimage [13:27:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2354.codfw.wmnet with reason: host reimage [13:28:00] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2356.codfw.wmnet with reason: host reimage [13:28:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2317.codfw.wmnet with OS bullseye [13:29:00] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (codfw): (no justification provided) [13:29:18] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (codfw): (no justification provided) (duration: 00m 17s) [13:29:41] (03PS1) 10Brouberol: Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 [13:30:12] (03CR) 10Brouberol: [C: 03+2] Allow pods in the dse k8s cluster to reach an-druid [puppet] - 10https://gerrit.wikimedia.org/r/997793 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [13:30:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2318.codfw.wmnet with OS bullseye [13:30:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2356.codfw.wmnet with reason: host reimage [13:32:49] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [13:32:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2319.codfw.wmnet with OS bullseye [13:33:37] (03CR) 10Volans: [C: 03+1] "Change LGTM, but I think will be cleaner with a minor hiera change, see inline." [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [13:33:44] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-web1001.eqiad.wmnet with reason: host reimage [13:34:13] !log installing openjdk-17 security updates [13:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:18] (03CR) 10Brouberol: [C: 03+2] Allow pods in the dse k8s cluster to reach public-druid [puppet] - 10https://gerrit.wikimedia.org/r/997794 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [13:36:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P56337 and previous config saved to /var/cache/conftool/dbconfig/20240206-133619-root.json [13:36:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-web1001.eqiad.wmnet with reason: host reimage [13:37:30] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [13:37:31] (03PS2) 10Brouberol: Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) [13:37:38] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (codfw): Ensure that all codfw nodes are running the same revision [13:37:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [13:38:11] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (codfw): Ensure that all codfw nodes are running the same revision (duration: 00m 32s) [13:38:46] !log jgiannelos@deploy2002 Started deploy [kartotherian/deploy@3325683] (eqiad): Ensure that all eqiad nodes are running the same revision [13:39:17] !log jgiannelos@deploy2002 Finished deploy [kartotherian/deploy@3325683] (eqiad): Ensure that all eqiad nodes are running the same revision (duration: 00m 31s) [13:39:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:39:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:40:31] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Idle - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [13:41:38] k8s@codfw BGP alert is expected? [13:42:55] RECOVERY - kartotherian endpoints health on maps1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:45:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2352.codfw.wmnet with OS bullseye [13:45:43] (03PS13) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis) [13:47:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2354.codfw.wmnet with OS bullseye [13:48:53] (03CR) 10CDanis: [C: 03+1] jaeger: route trace.w.o to jaeger-query [deployment-charts] - 10https://gerrit.wikimedia.org/r/997789 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [13:50:04] !log jmm@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=config-master,name=codfw [13:50:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2356.codfw.wmnet with OS bullseye [13:51:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P56338 and previous config saved to /var/cache/conftool/dbconfig/20240206-135124-root.json [13:51:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host config-master2001.codfw.wmnet [13:55:27] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host config-master2001.codfw.wmnet [13:56:28] !log jmm@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=config-master,name=codfw [13:56:50] !log jmm@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=config-master,name=eqiad [13:57:03] !log jmm@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=config-master,name=eqiad [13:57:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host config-master1001.eqiad.wmnet [13:59:44] (03PS3) 10Brouberol: Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1400). [14:00:05] Kizule and lucaswerkmeister: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:21] o/ [14:00:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-web1001.eqiad.wmnet with OS bullseye [14:01:05] (03CR) 10CI reject: [V: 04-1] Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [14:01:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS bookworm [14:01:47] I can deploy [14:01:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host config-master1001.eqiad.wmnet [14:02:10] !log jmm@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=config-master,name=eqiad [14:02:51] (03PS4) 10Brouberol: Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) [14:02:57] oh, Kizule removed the maintenance script request apparently [14:03:26] ah, https://phabricator.wikimedia.org/T350431#9517284 [14:03:27] :/ [14:04:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister) [14:04:43] I can reproduce T356505 at https://commons.wikimedia.org/w/index.php?title=File:CSD_Berlin_2019_-_Lucas_Werkmeister_-_24_-_Bi,_Pan,_Ace_Flags.jpg&action=submit [14:04:44] T356505: File page edit preview does not load Filepage.css - https://phabricator.wikimedia.org/T356505 [14:04:47] so I should be able to test the fix there [14:05:10] (03CR) 10Btullis: [C: 03+1] Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [14:05:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::webserver [14:05:13] (03CR) 10Stevemunene: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [14:05:35] (03CR) 10Brouberol: [C: 03+2] Enable dse k8s workers -> presto https traffic [puppet] - 10https://gerrit.wikimedia.org/r/997846 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [14:05:45] (03CR) 10Marostegui: "I will take care of merging this myself - as I want to issue a puppet run on the masters right after merging to make sure nothing gets wei" [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [14:06:41] Lucas_WMDE: is it required run namespacedupes.php if all were accessible after deployment for T355662 and need to run namespacedupes.php for T349581 should I add it to calendar [14:06:41] T355662: Create portal namespace on kannada wikipedia - https://phabricator.wikimedia.org/T355662 [14:06:41] T349581: Create draft namespace and add namespaces aliases for hewikinews - https://phabricator.wikimedia.org/T349581 [14:07:15] (03PS1) 10Muehlenhoff: Switch an-web to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/997850 (https://phabricator.wikimedia.org/T349619) [14:07:16] * Lucas_WMDE looks [14:07:45] I probably should’ve run it there and forgot, yeah [14:07:47] let me check now [14:08:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10Jhancock.wm) This rack is physically ready for tomorrow. [14:08:22] oh yeah knwiki has plenty of links to fix apparently [14:08:35] (no pages to fix, but would still be nice to fix the links) [14:08:52] likewise hewikinews (though with fewer links to fix) [14:09:27] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes knwiki --fix # T355662 (crashed) [14:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/997468 (https://phabricator.wikimedia.org/T355172) (owner: 10Slyngshede) [14:09:45] (03CR) 10Muehlenhoff: [C: 03+2] Switch an-web to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/997850 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:10:48] anzx: looks like the maintenance script needs to be fixed first [14:11:58] (03CR) 10Slyngshede: profile: remove Icinga-based systemd unit failed check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [14:12:09] (03CR) 10Slyngshede: [C: 04-1] profile: remove Icinga-based systemd unit failed check [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [14:12:28] ugh, and one of the gate-and-submits for the backport failed with ECONNRESET in npm [14:12:49] Lucas_WMDE: will ask again later in few days, thanks [14:13:04] sounds good, thanks! [14:13:05] (03CR) 10CI reject: [V: 04-1] Load Filepage.css when previewing File pages [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister) [14:13:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister) [14:13:19] let’s try that again… [14:14:25] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T356726 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [14:14:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::webserver [14:16:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-web1001.eqiad.wmnet [14:16:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister) [14:16:53] (not sure why `scap backport` exited early there for some reason while the build was still ongoing… I started it again now) [14:17:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:19:45] (03PS1) 10Slyngshede: Allow users to view the entire SSH key [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) [14:20:46] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Password reset: Allow signed in users to navigate. [software/bitu] - 10https://gerrit.wikimedia.org/r/997506 (https://phabricator.wikimedia.org/T355907) (owner: 10Slyngshede) [14:21:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P56340 and previous config saved to /var/cache/conftool/dbconfig/20240206-142134-root.json [14:22:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-web1001.eqiad.wmnet [14:22:35] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] CI Fix broken tests. [software/bitu] - 10https://gerrit.wikimedia.org/r/997468 (https://phabricator.wikimedia.org/T355172) (owner: 10Slyngshede) [14:26:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10Jhancock.wm) @BTullis I can reseat the backplane to try and fix this. Is it safe for me to do so? or are you currently working... [14:28:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) >>! In T355830#9517383, @Jhancock.wm wrote: > @BTullis I can reseat the backplane to try and fix this. Is it safe for... [14:32:31] !log debug convert-disks cookbook against out-of-use ms-be2044 T308677 [14:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:35] T308677: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 [14:32:44] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2044 [14:33:00] (03Merged) 10jenkins-bot: Load Filepage.css when previewing File pages [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997779 (https://phabricator.wikimedia.org/T356505) (owner: 10Lucas Werkmeister) [14:33:13] finally [14:33:23] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:997779|Load Filepage.css when previewing File pages (T356505)]] [14:33:32] T356505: File page edit preview does not load Filepage.css - https://phabricator.wikimedia.org/T356505 [14:34:40] (03CR) 10Filippo Giunchedi: "SGTM, thank you Manuel for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [14:35:10] (03CR) 10Muehlenhoff: Allow users to view the entire SSH key (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) (owner: 10Slyngshede) [14:35:27] 7m, apparently 14 k8s nodes are taking longer to docker pull the new image [14:35:29] *hm [14:36:16] (03PS1) 10Brouberol: service: register superset and superset-next under ingress [puppet] - 10https://gerrit.wikimedia.org/r/997857 (https://phabricator.wikimedia.org/T356483) [14:36:23] ok, that finished now [14:36:28] (03PS2) 10Brouberol: Add superset/superset-next.svc.eqiad.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/995174 (https://phabricator.wikimedia.org/T356481) [14:36:36] (03PS1) 10Brouberol: superset: setup dyna mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/997858 (https://phabricator.wikimedia.org/T356481) [14:36:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P56341 and previous config saved to /var/cache/conftool/dbconfig/20240206-143639-root.json [14:36:40] (03PS1) 10Brouberol: Superset: setup temporary external domains for the k8s deployments [dns] - 10https://gerrit.wikimedia.org/r/997859 (https://phabricator.wikimedia.org/T356482) [14:36:52] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and lucaswerkmeister: Backport for [[gerrit:997779|Load Filepage.css when previewing File pages (T356505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:37:20] seems to work fine \o/ [14:37:23] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and lucaswerkmeister: Continuing with sync [14:38:33] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1009.wikimedia.org [14:39:11] (03PS2) 10Brouberol: service: register superset and superset-next under ingress [puppet] - 10https://gerrit.wikimedia.org/r/997857 (https://phabricator.wikimedia.org/T356483) [14:39:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10Jhancock.wm) @BTullis looks like it worked. But since that backplane error occurred twice already, if it happens again lmk and... [14:44:14] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:997779|Load Filepage.css when previewing File pages (T356505)]] (duration: 10m 51s) [14:44:18] T356505: File page edit preview does not load Filepage.css - https://phabricator.wikimedia.org/T356505 [14:44:25] (SystemdUnitFailed) firing: (20) prometheus-phpfpm-statustext-textfile.service Failed on mw1371:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) >>! In T355830#9517443, @Jhancock.wm wrote: > @BTullis looks like it worked. But since that backplane error occurred... [14:44:58] now it’s also working without mwdebug, as far as I can tell \o/ [14:45:18] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:45:25] ok, then I think we’re done! [14:45:31] !log UTC afternoon backport+config window done [14:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:41] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [14:48:06] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1009.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [14:49:18] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1009.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [14:49:18] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1009.wikimedia.org [14:49:26] (SystemdUnitFailed) resolved: (40) prometheus-phpfpm-statustext-textfile.service Failed on mw1357:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:33] (03PS1) 10Muehlenhoff: New cookbook to reboot/restart config-master hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 [14:50:48] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:13] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2044 [14:51:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P56343 and previous config saved to /var/cache/conftool/dbconfig/20240206-145144-root.json [14:51:51] (03PS1) 10Muehlenhoff: Extend config-master Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/997888 [14:52:28] (03CR) 10Eevans: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/997814 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [14:52:33] (03PS1) 10Hashar: wm-checks-api: handle Zuul 'Merge failed' messages [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997889 (https://phabricator.wikimedia.org/T356647) [14:52:58] (03CR) 10Hashar: [C: 03+2] wm-checks-api: handle Zuul 'Merge failed' messages [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997889 (https://phabricator.wikimedia.org/T356647) (owner: 10Hashar) [14:53:31] (03PS2) 10Filippo Giunchedi: profile: remove absented statsd hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862) [14:53:35] (03PS2) 10Filippo Giunchedi: profile: remove Icinga-based systemd unit failed check [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) [14:53:41] (03Merged) 10jenkins-bot: wm-checks-api: handle Zuul 'Merge failed' messages [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997889 (https://phabricator.wikimedia.org/T356647) (owner: 10Hashar) [14:54:24] !log hashar@deploy2002 Started deploy [gerrit/gerrit@2e441ac]: wm-checks-api: handle Zuul 'Merge failed' messages - T356647 [14:54:30] T356647: wmf-checks-api: Gerrit checks display lists "merge failed" as success - https://phabricator.wikimedia.org/T356647 [14:54:31] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@2e441ac]: wm-checks-api: handle Zuul 'Merge failed' messages - T356647 (duration: 00m 07s) [14:54:32] (03CR) 10Filippo Giunchedi: profile: remove Icinga-based systemd unit failed check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [14:54:40] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye [14:56:37] !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [14:59:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:52] (03PS2) 10Filippo Giunchedi: cassandra: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997814 (https://phabricator.wikimedia.org/T337831) [15:02:53] (03CR) 10Filippo Giunchedi: [C: 03+2] cassandra: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997814 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [15:03:11] (03CR) 10MVernon: "Hi!" [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [15:04:19] (03CR) 10Filippo Giunchedi: [C: 03+2] jaeger: route trace.w.o to jaeger-query [deployment-charts] - 10https://gerrit.wikimedia.org/r/997789 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [15:06:29] (03CR) 10Jforrester: [C: 03+1] libraryupgrader: use system docker on newer Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/997548 (owner: 10Majavah) [15:06:37] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on netbox1002.eqiad.wmnet with reason: Restoring DB from backup on netboxdb1002 [15:06:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P56344 and previous config saved to /var/cache/conftool/dbconfig/20240206-150649-root.json [15:06:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on netbox1002.eqiad.wmnet with reason: Restoring DB from backup on netboxdb1002 [15:07:50] !log Disabling netbox service on netbox1002 prior to db restore from backup [15:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:41] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [15:14:10] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [15:14:24] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [15:15:41] PROBLEM - netbox Postgres on netboxdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB netbox (host:localhost) 22039184 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:16:48] (03PS1) 10Clément Goubert: prometheus-apache-exporter: Bump version to 0.0.4 [puppet] - 10https://gerrit.wikimedia.org/r/997894 (https://phabricator.wikimedia.org/T283861) [15:16:55] RECOVERY - netbox Postgres on netboxdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB netbox (host:localhost) 0 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:17:26] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:02] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1284/co" [puppet] - 10https://gerrit.wikimedia.org/r/997894 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [15:23:25] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2094.codfw.wmnet with reason: host reimage [15:25:50] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860 [15:25:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860 [15:25:54] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [15:26:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2094.codfw.wmnet with reason: host reimage [15:26:33] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=wdqs2016.codfw.wmnet [15:27:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] P:docker::builder clean docker image cache regularly. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997796 (owner: 10Slyngshede) [15:27:39] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860 [15:27:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860 [15:28:42] !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [15:29:16] (03PS1) 10Clément Goubert: prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) [15:34:17] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning all hosts in row B4 for switch maintenance - bking@cumin2002 - T355860 [15:34:17] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning all hosts in row B4 for switch maintenance - bking@cumin2002 - T355860 [15:34:21] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [15:37:19] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860 [15:37:19] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning all hosts in row B for switch maintenance - bking@cumin2002 - T355860 [15:37:26] (SystemdUnitFailed) firing: (11) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:51] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: eelastic2058,elastic2070,elastic2095,elastic2096 for switch maintenance - bking@cumin2002 - T355860 [15:41:52] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: eelastic2058,elastic2070,elastic2095,elastic2096 for switch maintenance - bking@cumin2002 - T355860 [15:41:55] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [15:41:57] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: sync on main [15:42:26] (SystemdUnitFailed) firing: (11) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:36] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2058,elastic2070,elastic2095,elastic2096 for switch maintenance - bking@cumin2002 - T355860 [15:42:36] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2058,elastic2070,elastic2095,elastic2096 for switch maintenance - bking@cumin2002 - T355860 [15:43:18] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2058 for switch maintenance - bking@cumin2002 - T355860 [15:43:18] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2058 for switch maintenance - bking@cumin2002 - T355860 [15:43:28] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@05fa5c9]: Disabling storage for ptwiki [15:43:45] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2058* for switch maintenance - bking@cumin2002 - T355860 [15:43:47] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2058* for switch maintenance - bking@cumin2002 - T355860 [15:44:00] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2094.codfw.wmnet with OS bullseye [15:44:09] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2058*,elastic2070*,elastic2095*,elastic2096* for switch maintenance - bking@cumin2002 - T355860 [15:44:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2094.codfw.wmnet with OS... [15:44:12] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2058*,elastic2070*,elastic2095*,elastic2096* for switch maintenance - bking@cumin2002 - T355860 [15:44:29] (ProbeDown) firing: (2) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:46:38] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [15:46:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @Jhancock.wm - The reimage cookbook hung once at PXE boot, but I gave it a... [15:46:54] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [15:47:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:47:26] (SystemdUnitFailed) firing: (8) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:47:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:48:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:40] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1009.wikimedia.org [15:49:24] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2033-2034].codfw.wmnet with reason: T355860 [15:50:55] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [15:51:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2033-2034].codfw.wmnet with reason: T355860 [15:51:17] 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, 10Epic: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10TK-999) [15:51:56] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:52:34] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [15:53:20] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:21] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudelastic1009.wikimedia.org [15:54:29] (ProbeDown) firing: (2) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:56:33] !log moving Netbox server uplinks from asw-b4-codfw to lsw1-b4-codfw to prep config for server moves T355860 [15:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:40] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [15:56:47] (03CR) 10Clément Goubert: "Removing vote until I understand our systemd monitoring a little better" [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [15:57:54] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:23] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-b-codfw,lsw1-b4-codfw.mgmt with reason: prepping for server uplink migration [15:58:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-b-codfw,lsw1-b4-codfw.mgmt with reason: prepping for server uplink migration [15:58:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1cb41722-6e24-4871-a903-cdb117a03449) set by cmooney... [15:58:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:58:55] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr[1-2]-codfw with reason: prepping for server uplink migration [15:58:57] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10isarantopoulos) I tried to delete a revision and an inferenceservice on experimental namespace and it seems that I don't have access: ` kubectl delete re... [15:59:10] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr[1-2]-codfw with reason: prepping for server uplink migration [15:59:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:59:17] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2349fc0-73a1-418a-b3b8-284c8a40d573) set by cmooney... [15:59:29] (ProbeDown) firing: (2) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:05] eoghan, jelto, and arnoldokoth: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1600). [16:00:09] !log configuring lsw1-b4-codfw with port config for new hosts T355860 [16:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:08] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@05fa5c9]: Disabling storage for ptwiki (duration: 17m 39s) [16:01:25] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:01:26] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:02:12] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:02:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:02:41] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw [16:02:57] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:04] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:03:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw [16:03:13] uh [16:03:18] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:03:20] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a3c16d29-3284-4390-9f38-033ef67e36ff) set by cmooney... [16:03:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:04:29] (ProbeDown) firing: (6) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:04:31] (03CR) 10Brouberol: Add a deployment chart for Superset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [16:04:59] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1009 to private IPs - bking@cumin2002" [16:05:15] !log Commencing server uplink moves from old switch to new in codfw rack B4 T355860 [16:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:19] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [16:05:25] (03PS1) 10Ilias Sarantopoulos: ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 [16:05:45] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1009 to private IPs - bking@cumin2002" [16:05:46] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:22] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1009 [16:07:57] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:18] (03CR) 10Brouberol: Add a deployment chart for Superset (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [16:08:49] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1009 [16:10:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [16:10:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [16:10:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T355609)', diff saved to https://phabricator.wikimedia.org/P56347 and previous config saved to /var/cache/conftool/dbconfig/20240206-161043-marostegui.json [16:10:47] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:10:49] !log Hosts migrated and basic connectivity ok codfw rack B4 T355860 [16:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:54] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [16:12:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:12:58] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) All hosts moved successfully, all now responding to pings fine and MAC forwarding tables look correct. [16:13:39] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:15:02] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:17:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:18:17] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [16:18:19] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [16:18:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T355609)', diff saved to https://phabricator.wikimedia.org/P56348 and previous config saved to /var/cache/conftool/dbconfig/20240206-161849-marostegui.json [16:18:53] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:20:48] (03CR) 10AikoChou: [C: 03+1] ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 (owner: 10Ilias Sarantopoulos) [16:21:38] (03CR) 10Klausman: [C: 03+1] ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 (owner: 10Ilias Sarantopoulos) [16:23:00] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:31] (03PS1) 10Brouberol: ferm: fix typo in the public druid ferm_srange rule [puppet] - 10https://gerrit.wikimedia.org/r/997906 [16:24:37] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 (owner: 10Ilias Sarantopoulos) [16:25:30] (03Merged) 10jenkins-bot: ml-services: remove gpu from article-descriptions in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/997901 (owner: 10Ilias Sarantopoulos) [16:25:56] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:25:59] (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/997906 (owner: 10Brouberol) [16:26:26] (03CR) 10FNegri: [C: 03+1] Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah) [16:26:30] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:26:40] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2087*,elastic2037*,elastic2038*,elastic2055*,elastic2088*,elastic2073*,elastic2074* for switch maintenance - bking@cumin2002 - T355860 [16:26:43] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [16:26:43] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2087*,elastic2037*,elastic2038*,elastic2055*,elastic2088*,elastic2073*,elastic2074* for switch maintenance - bking@cumin2002 - T355860 [16:26:46] (03CR) 10Brouberol: [C: 03+2] ferm: fix typo in the public druid ferm_srange rule [puppet] - 10https://gerrit.wikimedia.org/r/997906 (owner: 10Brouberol) [16:26:59] !log T353459 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list.txt [16:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:03] T353459: Develop a prototype for Event Invitations with scoring on likelihood of valuable participation - https://phabricator.wikimedia.org/T353459 [16:27:10] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:52] erm looks like we got a problem with mw-on-k8s [16:29:10] Trying to find out what [16:29:28] I can curl but I get random timeouts with httpbb [16:29:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp[2033-2034].codfw.wmnet [16:29:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp[2033-2034].codfw.wmnet [16:29:35] Probably some pods in a bad state [16:29:41] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [16:29:50] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=wdqs2016.codfw.wmnet [16:30:19] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=(cdn|ats-be) [16:30:20] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=(cdn|ats-be) [16:30:56] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:33:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56349 and previous config saved to /var/cache/conftool/dbconfig/20240206-163355-marostegui.json [16:34:09] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cloudelastic1009.mgmt.eqiad.wmnet on all recursors [16:34:10] (03CR) 10JMeybohm: [C: 04-2] "Think we need to hold back because of https://phabricator.wikimedia.org/T356787 - these hosts are still buster." [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [16:34:13] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudelastic1009.mgmt.eqiad.wmnet on all recursors [16:34:27] (03CR) 10JMeybohm: [C: 04-2] "Think we need to hold back because of https://phabricator.wikimedia.org/T356787 - these hosts are still buster." [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [16:34:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:35:16] !log Roll-restarting mw-api-ext deployment in codfw [16:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye [16:35:42] (03CR) 10JMeybohm: [C: 03+1] "I think this is fine as the resource was absent anyways" [puppet] - 10https://gerrit.wikimedia.org/r/997818 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [16:35:59] 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10MatthewVernon) [16:36:44] <_joe_> jouncebot: nowandnext [16:36:44] For the next 0 hour(s) and 23 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1600) [16:36:44] In 0 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1700) [16:38:07] (03CR) 10Scott French: "Thanks again for the review. Also that's great - I'd not seen `Hosts: auto` before." [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [16:38:37] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [16:38:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:38:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:39:12] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:39:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:40:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:40:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:40:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:24] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:36] (03CR) 10MVernon: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans) [16:42:38] (03CR) 10JMeybohm: [C: 04-1] etcd: remove nrpe::monitor_systemd_unit_state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [16:43:09] 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10jcrespo) Potentially a similar issue (request/traffic related high load) happened around the past 28 of September, when I added this TOD... [16:45:06] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:47:06] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:26] (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56352 and previous config saved to /var/cache/conftool/dbconfig/20240206-164902-marostegui.json [16:49:14] (03CR) 10MVernon: "I think the change to spare has to be wrong (as the role doesn't exist)." [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [16:50:05] (03CR) 10MVernon: "Sorry, having seen the other change and PCC error, I think that problem must fit here as well - there isn't a spare::system role that I ca" [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans) [16:51:19] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage [16:53:19] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/997888 (owner: 10Muehlenhoff) [16:53:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:43] (03PS21) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [16:54:09] 10SRE-OnFire, 10Incident Tooling: Corto: internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10lmata) [16:54:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage [16:54:37] 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10lmata) [16:54:44] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [16:54:48] 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10jhathaway) [16:55:17] 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10jhathaway) [16:55:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:56:22] 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10jhathaway) [16:56:55] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10MatthewVernon) swift backends look happy, thanks :) [16:57:33] (03PS3) 10Eevans: sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) [16:58:15] 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790 (10fgiunchedi) [16:58:23] (03PS3) 10Eevans: Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) [16:59:53] (03CR) 10Brouberol: [C: 03+1] Bring two new stat servers into service [puppet] - 10https://gerrit.wikimedia.org/r/997797 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [17:00:05] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1700). Please do the needful. [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:07] (03PS4) 10JHathaway: rsyslog: have rsyslog create its own files [puppet] - 10https://gerrit.wikimedia.org/r/997555 [17:03:25] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:04:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T355609)', diff saved to https://phabricator.wikimedia.org/P56353 and previous config saved to /var/cache/conftool/dbconfig/20240206-170408-marostegui.json [17:04:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [17:04:19] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:04:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [17:04:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T355609)', diff saved to https://phabricator.wikimedia.org/P56354 and previous config saved to /var/cache/conftool/dbconfig/20240206-170431-marostegui.json [17:05:23] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:08] (03CR) 10MVernon: [C: 03+1] sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans) [17:06:45] (03CR) 10MVernon: [C: 03+1] Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [17:06:58] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [17:08:25] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:08:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:10:33] (03CR) 10BryanDavis: Provide context for account creation. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede) [17:11:22] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [17:12:37] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T355609)', diff saved to https://phabricator.wikimedia.org/P56355 and previous config saved to /var/cache/conftool/dbconfig/20240206-171240-marostegui.json [17:12:45] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:13:41] (03PS4) 10Eevans: sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) [17:13:47] RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:13:51] (SwaggerProbeHasFailures) resolved: (2) Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:15:39] (03PS1) 10Andrew Bogott: rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921 [17:16:27] (03PS2) 10Andrew Bogott: rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921 [17:17:25] (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:37] (03CR) 10David Caro: [C: 03+1] rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921 (owner: 10Andrew Bogott) [17:18:06] (03CR) 10Volans: [C: 03+1] "LGTM, open question for the path" [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff) [17:18:40] (03PS3) 10Andrew Bogott: rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921 (https://phabricator.wikimedia.org/T356621) [17:19:13] (03CR) 10Eevans: [C: 03+2] sessionstore: remove EOL hosts [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans) [17:22:46] !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts sessionstore[1001-1003].eqiad.wmnet [17:24:21] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: increase heartbeat timeout and number of heartbeats [puppet] - 10https://gerrit.wikimedia.org/r/997921 (https://phabricator.wikimedia.org/T356621) (owner: 10Andrew Bogott) [17:25:38] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: dc=codfw,service=kubesvc,name=mw.* [17:26:31] <_joe_> ok, now pooling them [17:26:35] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,service=kubesvc,name=mw.* [17:26:52] doing the same on eqiad [17:27:31] !log cgoubert@cumin2002 conftool action : set/weight=10; selector: name=mw.*,dc=eqiad,cluster=kubernetes,service=kubesvc [17:27:39] pooling now [17:27:44] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10Eevans) 05Open→03Resolved The RAID has been rebuilt, let's hope 3rd time is the charm! [17:27:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P56356 and previous config saved to /var/cache/conftool/dbconfig/20240206-172747-marostegui.json [17:27:55] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=mw.*,dc=eqiad,cluster=kubernetes,service=kubesvc [17:30:29] (03PS1) 10Eevans: site.pp: remove EOL sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/997928 (https://phabricator.wikimedia.org/T353405) [17:31:01] (03PS1) 10Giuseppe Lavagetto: Do not add env variables when they're empty [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997873 (https://phabricator.wikimedia.org/T356780) [17:33:30] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [17:35:27] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sessionstore[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [17:35:59] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10akosiaris) Let's see how I can be of help. > what branch is used to build the package It's configurable in gbp, but the default workflow assumes that the code from... [17:36:40] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sessionstore[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [17:36:40] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:40] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sessionstore[1001-1003].eqiad.wmnet [17:37:46] !log rebooting kubernetes2010.codfw.wmnet [17:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:32] (03CR) 10Eevans: [C: 03+2] site.pp: remove EOL sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/997928 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans) [17:42:13] 10ops-eqiad, 10Cassandra, 10decommission-hardware: Decommission sessionstore100[1-3] - https://phabricator.wikimedia.org/T356719 (10Eevans) a:05Eevans→03None [17:42:40] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2073-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:42:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P56357 and previous config saved to /var/cache/conftool/dbconfig/20240206-174253-marostegui.json [17:43:14] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reboot-single for host kubernetes2010.codfw.wmnet [17:43:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:44:10] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:52] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:59] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye [17:48:25] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:49] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2010.codfw.wmnet [17:52:52] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:52:58] (03PS12) 10Bking: sre.hosts.reimage: Suggest install-console for troubleshooting [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) [17:53:25] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:40] !log wikikube: cordon nodes added earlier today in codfw [17:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:55] (03PS1) 10Bking: cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) [17:57:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:57:52] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:58:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T355609)', diff saved to https://phabricator.wikimedia.org/P56358 and previous config saved to /var/cache/conftool/dbconfig/20240206-175800-marostegui.json [17:58:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [17:58:05] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:58:07] !log uncordoning kubernetes2010 [17:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [17:58:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T355609)', diff saved to https://phabricator.wikimedia.org/P56359 and previous config saved to /var/cache/conftool/dbconfig/20240206-175822-marostegui.json [17:59:06] !log wikikube codfw: drain newly added nodes [17:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:29] (ProbeDown) firing: (2) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1800) [18:00:48] (03PS2) 10Bking: cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) [18:00:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:01:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:01:42] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:03:25] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:05:20] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:05:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:06:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:06:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T355609)', diff saved to https://phabricator.wikimedia.org/P56360 and previous config saved to /var/cache/conftool/dbconfig/20240206-180641-marostegui.json [18:06:45] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:09:11] (03PS1) 10Btullis: [DPE Postgres] Only backup the latest postgres dump file [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) [18:09:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997873 (https://phabricator.wikimedia.org/T356780) (owner: 10Giuseppe Lavagetto) [18:12:00] (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1285/console" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:12:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic2037-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:13:16] !log wikikube codfw: belated homer commit of new nodes [18:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:34] :eyes on elastic alert above [18:14:44] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:26] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:17:35] (03PS4) 10Eevans: Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) [18:17:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2037-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:18:53] (03CR) 10Brouberol: [C: 03+1] "Looks good 👍" [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis) [18:20:00] (03CR) 10Eevans: [C: 03+2] Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [18:20:53] !log wikikube codfw: uncordon new nodes [18:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P56362 and previous config saved to /var/cache/conftool/dbconfig/20240206-182148-marostegui.json [18:22:28] !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts restbase[2013-2020].codfw.wmnet [18:24:06] (03PS1) 10Eevans: site.pp: remove decommissioned restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997941 (https://phabricator.wikimedia.org/T352469) [18:27:15] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 7 hosts with reason: T355860 [18:27:19] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [18:27:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 7 hosts with reason: T355860 [18:28:06] (03PS3) 10Scott French: systemd::unit: clean up ownership file [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) [18:28:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:24] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:30:32] (03Merged) 10jenkins-bot: Do not add env variables when they're empty [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997873 (https://phabricator.wikimedia.org/T356780) (owner: 10Giuseppe Lavagetto) [18:30:56] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:997873|Do not add env variables when they're empty (T356780)]] [18:31:01] T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780 [18:31:46] (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1286/console" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:32:38] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:997873|Do not add env variables when they're empty (T356780)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:36:00] !log oblivian@deploy2002 oblivian: Continuing with sync [18:36:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P56363 and previous config saved to /var/cache/conftool/dbconfig/20240206-183654-marostegui.json [18:38:25] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:42:12] <_joe_> uhm [18:42:16] <_joe_> still firing [18:42:52] (03CR) 10Scott French: "Thanks for taking a look, Moritz. I was actually going to add you as a reviewer, as I saw you originally reviewed [0]." [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [18:42:53] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:997873|Do not add env variables when they're empty (T356780)]] (duration: 11m 57s) [18:43:09] T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780 [18:43:25] (SystemdUnitFailed) firing: (25) httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:43:40] (SystemdUnitFailed) firing: (25) httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:44:10] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:26] (03CR) 10BCornwall: [V: 03+1] fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [18:45:42] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [18:46:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:47:47] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[2013-2020].codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [18:48:26] (SystemdUnitFailed) resolved: (45) httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:50] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[2013-2020].codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [18:48:50] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:48:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase[2013-2020].codfw.wmnet [18:49:53] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [18:50:22] (03CR) 10Eevans: [C: 03+2] site.pp: remove decommissioned restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997941 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [18:50:49] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [18:52:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T355609)', diff saved to https://phabricator.wikimedia.org/P56364 and previous config saved to /var/cache/conftool/dbconfig/20240206-185201-marostegui.json [18:52:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [18:52:06] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:52:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [18:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T355609)', diff saved to https://phabricator.wikimedia.org/P56365 and previous config saved to /var/cache/conftool/dbconfig/20240206-185223-marostegui.json [18:52:31] 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase20[13-20] - https://phabricator.wikimedia.org/T356695 (10Eevans) [19:00:05] brennen and dancy: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T1900). [19:00:19] o/ [19:00:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T355609)', diff saved to https://phabricator.wikimedia.org/P56366 and previous config saved to /var/cache/conftool/dbconfig/20240206-190037-marostegui.json [19:00:50] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:06:51] !log train 1.42.0-wmf.17: considering unblocked for group0, rolling forward. [19:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:33] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997945 (https://phabricator.wikimedia.org/T354435) [19:07:35] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997945 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot) [19:08:26] (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1289/console" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:10:58] o/ [19:13:07] Many `.17 e/C/i/J/JobTraits:92 Received cirrusSearchElasticaWrite job for an unwritable cluster cloudelastic.` [19:13:13] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997945 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot) [19:13:14] brennen: [19:13:26] although they're trailing off. [19:13:53] dancy what was the timeline for those errors? We've been migrating cloudelastic to private IPs [19:13:53] and now gone. :-) [19:14:07] I was looking at last 15 minutes.. and then they went away [19:14:41] inflatador: Thanks for the info! [19:14:54] Crisis averted. :-) [19:15:15] sure, more context in T355617 if interested. No impact expected, but that cluster doesn't have a lot of redundancy ;( [19:15:16] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [19:15:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P56367 and previous config saved to /var/cache/conftool/dbconfig/20240206-191544-marostegui.json [19:16:32] PROBLEM - WDQS SPARQL on wdqs1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 400 bytes in 0.716 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:17:17] dancy: thanks for ping. i'd seen that spike and assumed it was likely something transient but didn't dig in. [19:17:46] RECOVERY - WDQS SPARQL on wdqs1021 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:18:33] dancy if you can give a timeline (or a place to look for a timeline) I can re-queue the writes that failed [19:18:55] inflatador: Sure. Stand by. [19:20:25] (SystemdUnitFailed) firing: (2) prometheus-phpfpm-statustext-textfile.service Failed on mw1406:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:29] also now seeing a bunch of "Received cirrusSearchCheckerJob job for an unwritable cluster default" [19:21:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [19:21:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.eqiad.wmnet with OS bullseye [19:21:17] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.17 refs T354435 [19:21:22] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [19:22:21] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "finish cloudelastic1009 private IP migration - bking@cumin2002 - T355617" [19:22:25] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [19:23:13] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "finish cloudelastic1009 private IP migration - bking@cumin2002 - T355617" [19:23:58] ebernhardson ^^ any opinion on those "Received cirrusSearchCheckerJob job for an unwritable cluster default" errors? [19:25:26] (SystemdUnitFailed) resolved: (46) prometheus-phpfpm-statustext-textfile.service Failed on mw1350:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:17] (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1290/console" [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:30:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P56368 and previous config saved to /var/cache/conftool/dbconfig/20240206-193052-marostegui.json [19:31:34] ~186 of those. [19:35:26] !log joal@deploy2002 Started deploy [analytics/refinery@718fc41]: Regular analytics weekly train [analytics/refinery@718fc417] [19:42:04] (03PS3) 10Bking: cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) [19:42:22] (03CR) 10Muehlenhoff: "If you want to take a server out of production use for some time before the eventual decom, the insetup::foo roles should be used." [puppet] - 10https://gerrit.wikimedia.org/r/997607 (https://phabricator.wikimedia.org/T353405) (owner: 10Eevans) [19:43:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [19:43:11] (03CR) 10Ebernhardson: [C: 03+1] cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:43:39] (03CR) 10Bking: [C: 03+2] cloudelastic: Complete cloudelastic1009's migration [puppet] - 10https://gerrit.wikimedia.org/r/997933 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:45:42] dancy quick update re: cloudelastic errors. Based on convo w e-bernhardson they are nothing to worry about...cloudelastic is supposed to be read-only from the normal jobrunner pipeline ATM [19:45:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T355609)', diff saved to https://phabricator.wikimedia.org/P56370 and previous config saved to /var/cache/conftool/dbconfig/20240206-194558-marostegui.json [19:46:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [19:46:03] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:46:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [19:46:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:46:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:46:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T355609)', diff saved to https://phabricator.wikimedia.org/P56371 and previous config saved to /var/cache/conftool/dbconfig/20240206-194639-marostegui.json [19:47:43] !log joal@deploy2002 Finished deploy [analytics/refinery@718fc41]: Regular analytics weekly train [analytics/refinery@718fc417] (duration: 12m 17s) [19:49:29] thx inflatador. [19:49:39] !log joal@deploy2002 Started deploy [analytics/refinery@718fc41] (thin): Regular analytics weekly train THIN [analytics/refinery@718fc417] [19:49:45] !log joal@deploy2002 Finished deploy [analytics/refinery@718fc41] (thin): Regular analytics weekly train THIN [analytics/refinery@718fc417] (duration: 00m 06s) [19:49:58] !log joal@deploy2002 Started deploy [analytics/refinery@718fc41] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@718fc417] [19:50:16] (03PS1) 10Muehlenhoff: Revert "admin: remove ssh key of Connie Chen" [puppet] - 10https://gerrit.wikimedia.org/r/997952 (https://phabricator.wikimedia.org/T356645) [19:52:02] (03PS2) 10Jdlrobson: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728) [19:52:10] (03CR) 10Muehlenhoff: [C: 03+2] Revert "admin: remove ssh key of Connie Chen" [puppet] - 10https://gerrit.wikimedia.org/r/997952 (https://phabricator.wikimedia.org/T356645) (owner: 10Muehlenhoff) [19:52:12] (03PS1) 10Jdlrobson: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997874 (https://phabricator.wikimedia.org/T356728) [19:53:32] !log joal@deploy2002 Finished deploy [analytics/refinery@718fc41] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@718fc417] (duration: 03m 33s) [19:55:19] (03PS1) 10Muehlenhoff: Revert: admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/997953 (https://phabricator.wikimedia.org/T356645) [19:55:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T355609)', diff saved to https://phabricator.wikimedia.org/P56372 and previous config saved to /var/cache/conftool/dbconfig/20240206-195532-marostegui.json [19:55:42] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:56:33] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@93fa570]: (no justification provided) [19:57:02] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@93fa570]: (no justification provided) (duration: 00m 28s) [19:59:09] (03CR) 10Muehlenhoff: [C: 03+2] Revert: admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/997953 (https://phabricator.wikimedia.org/T356645) (owner: 10Muehlenhoff) [20:04:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10MoritzMuehlenhoff) - The SSH key was reinstated, the changes roll out across the next 30 minutes. - The POSIX groups were readded, the changes roll out... [20:07:46] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [20:07:48] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [20:10:32] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@5f38647]: (no justification provided) [20:10:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56373 and previous config saved to /var/cache/conftool/dbconfig/20240206-201039-marostegui.json [20:10:59] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@5f38647]: (no justification provided) (duration: 00m 27s) [20:21:11] !log joal@deploy2002 Started deploy [airflow-dags/analytics@09b8dc5]: Regular analytics weekly train [airflow-dags/analytics@09b8dc55] [20:21:39] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@09b8dc5]: Regular analytics weekly train [airflow-dags/analytics@09b8dc55] (duration: 00m 28s) [20:22:25] (03CR) 10Andrew Bogott: [C: 03+2] Remove memcached cruft from codfw1dev cloudservice nodes [puppet] - 10https://gerrit.wikimedia.org/r/997554 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [20:25:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56374 and previous config saved to /var/cache/conftool/dbconfig/20240206-202546-marostegui.json [20:27:40] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1009.eqiad.wmnet [20:27:45] (03PS2) 10Andrew Bogott: Removed refs to openstack version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/997538 [20:27:55] !log bking@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=cloudelastic,name=cloudelastic1009.eqiad.wmnet [20:28:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) Thanks so much @MoritzMuehlenhoff!! [20:40:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T355609)', diff saved to https://phabricator.wikimedia.org/P56375 and previous config saved to /var/cache/conftool/dbconfig/20240206-204053-marostegui.json [20:40:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [20:40:57] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:41:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [20:41:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T355609)', diff saved to https://phabricator.wikimedia.org/P56376 and previous config saved to /var/cache/conftool/dbconfig/20240206-204115-marostegui.json [20:51:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T355609)', diff saved to https://phabricator.wikimedia.org/P56377 and previous config saved to /var/cache/conftool/dbconfig/20240206-205101-marostegui.json [20:51:09] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:59:34] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240206T2100). [21:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:57] Jdlrobson: if you're around, I can deploy your patches [21:02:36] (03PS1) 10Majavah: WebRequest: Fix default for backwards compat [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997876 (https://phabricator.wikimedia.org/T356800) [21:05:52] present cjming [21:06:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P56378 and previous config saved to /var/cache/conftool/dbconfig/20240206-210607-marostegui.json [21:07:08] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@916bff2]: (no justification provided) [21:07:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson) [21:07:37] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@916bff2]: (no justification provided) (duration: 00m 29s) [21:09:01] 10SRE, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10cmooney) Seems reasonable. There are some good reasons not to go too far (reducing load both our side and for recursive servers on the internet), but 5 mins seems ok to me. [21:18:24] (03CR) 10Andrew Bogott: [C: 03+2] Removed refs to openstack version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/997538 (owner: 10Andrew Bogott) [21:21:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P56379 and previous config saved to /var/cache/conftool/dbconfig/20240206-212114-marostegui.json [21:22:17] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) 05Open→03Resolved a:03cmooney Closing task, all looks good following change. Big thanks to @Jhancock.... [21:22:25] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [21:28:41] (03Merged) 10jenkins-bot: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997282 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson) [21:29:08] !log cjming@deploy2002 Started scap: Backport for [[gerrit:997282|Reduce font size of diff heading (T356728)]] [21:29:12] T356728: Regression: Font size increased on diff pages - https://phabricator.wikimedia.org/T356728 [21:30:37] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:997282|Reduce font size of diff heading (T356728)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:30:40] Jdlrobson: wanna test 1st patch? [21:30:43] yep [21:30:46] wmf16? [21:30:49] yes [21:31:04] cjming: yep that did [21:31:04] it [21:31:06] please sync :) [21:31:11] will do [21:31:14] !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync [21:31:16] The other one you can also sync - no need to test as it's not live yet. [21:31:25] alrighty [21:31:40] Thank you :) [21:32:14] (03CR) 10Clare Ming: [C: 03+2] Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997874 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson) [21:32:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:54] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) a:03jhathaway [21:36:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T355609)', diff saved to https://phabricator.wikimedia.org/P56380 and previous config saved to /var/cache/conftool/dbconfig/20240206-213621-marostegui.json [21:36:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [21:36:30] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:36:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [21:37:25] (SystemdUnitFailed) firing: (10) prometheus-phpfpm-statustext-textfile.service Failed on mw1384:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:45] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:997282|Reduce font size of diff heading (T356728)]] (duration: 08m 37s) [21:37:49] T356728: Regression: Font size increased on diff pages - https://phabricator.wikimedia.org/T356728 [21:38:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997874 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson) [21:39:27] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) 05Open→03Resolved @bcampbell I assume this is resolved, please go... [21:42:26] (SystemdUnitFailed) resolved: (34) prometheus-phpfpm-statustext-textfile.service Failed on mw1364:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:47:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:52:25] (03Merged) 10jenkins-bot: Reduce font size of diff heading [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997874 (https://phabricator.wikimedia.org/T356728) (owner: 10Jdlrobson) [21:52:49] !log cjming@deploy2002 Started scap: Backport for [[gerrit:997874|Reduce font size of diff heading (T356728)]] [21:52:55] T356728: Regression: Font size increased on diff pages - https://phabricator.wikimedia.org/T356728 [21:54:16] (03PS1) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) [21:54:17] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:997874|Reduce font size of diff heading (T356728)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:54:45] !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync [21:55:26] (03CR) 10FNegri: [C: 03+1] "We should probably investigate what's broken, but I'm ok with merging if this fixes the issue. Please create a task to track this in Phab." [puppet] - 10https://gerrit.wikimedia.org/r/994250 (owner: 10Majavah) [21:56:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [21:57:41] (SystemdUnitFailed) firing: (33) prometheus-phpfpm-statustext-textfile.service Failed on mw1364:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:58:20] (03CR) 10CI reject: [V: 04-1] OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [21:59:31] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10bcampbell) @jhathaway Sorry for not closing the loop on this one. It is resolved... [22:00:16] thank you cjming [22:00:48] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:00:53] yw! wmf17 patch live soon [22:01:14] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:997874|Reduce font size of diff heading (T356728)]] (duration: 08m 24s) [22:01:30] T356728: Regression: Font size increased on diff pages - https://phabricator.wikimedia.org/T356728 [22:01:52] !log end of UTC late backport window [22:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:04] (03PS2) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) [22:02:40] (SystemdUnitFailed) firing: (32) prometheus-phpfpm-statustext-textfile.service Failed on mw1370:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:02:55] (SystemdUnitFailed) firing: (33) prometheus-phpfpm-statustext-textfile.service Failed on mw1370:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:09] (03CR) 10CI reject: [V: 04-1] OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [22:07:10] (03PS3) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) [22:07:41] (SystemdUnitFailed) resolved: (33) prometheus-phpfpm-statustext-textfile.service Failed on mw1370:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [22:14:14] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:17:41] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:27:05] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@11e5c60]: (no justification provided) [22:27:33] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@11e5c60]: (no justification provided) (duration: 00m 28s) [22:35:53] (03PS1) 10Jforrester: Fix PermissionException being logged [extensions/Flow] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997877 (https://phabricator.wikimedia.org/T356223) [22:36:08] (03PS1) 10Jforrester: Fix PermissionException being logged [extensions/Flow] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997878 (https://phabricator.wikimedia.org/T356223) [22:42:54] jouncebot nowandnext [22:42:54] No deployments scheduled for the next 8 hour(s) and 17 minute(s) [22:42:54] In 8 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0700) [22:43:41] (03PS1) 10Jforrester: Set the memory limit in bytes. [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997879 (https://phabricator.wikimedia.org/T356780) [22:47:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997876 (https://phabricator.wikimedia.org/T356800) (owner: 10Majavah) [23:08:41] (03Merged) 10jenkins-bot: WebRequest: Fix default for backwards compat [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997876 (https://phabricator.wikimedia.org/T356800) (owner: 10Majavah) [23:09:05] !log brennen@deploy2002 Started scap: Backport for [[gerrit:997876|WebRequest: Fix default for backwards compat (T356800)]] [23:09:09] T356800: ArgumentCountError: Too few arguments to function MediaWiki\Request\WebRequest::getRequestPathSuffix() - https://phabricator.wikimedia.org/T356800 [23:10:36] !log brennen@deploy2002 taavi and brennen: Backport for [[gerrit:997876|WebRequest: Fix default for backwards compat (T356800)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:11:21] appears to fix officewiki image glitches. [23:11:41] !log brennen@deploy2002 taavi and brennen: Continuing with sync [23:12:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:25] (SystemdUnitFailed) firing: (7) prometheus-phpfpm-statustext-textfile.service Failed on mw1355:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:07] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:997876|WebRequest: Fix default for backwards compat (T356800)]] (duration: 09m 02s) [23:18:11] T356800: ArgumentCountError: Too few arguments to function MediaWiki\Request\WebRequest::getRequestPathSuffix() - https://phabricator.wikimedia.org/T356800 [23:22:25] (SystemdUnitFailed) resolved: (36) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:10] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@034ea4b]: (no justification provided) [23:24:38] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@034ea4b]: (no justification provided) (duration: 00m 27s)