[00:13:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:38:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949214
[00:38:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949214 (owner: 10TrainBranchBot)
[00:40:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949215
[00:40:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949215 (owner: 10TrainBranchBot)
[00:42:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:52:32] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949214 (owner: 10TrainBranchBot)
[00:55:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (96) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[00:55:58] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949215 (owner: 10TrainBranchBot)
[01:27:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:44] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:47:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:28] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:51:32] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:53:44] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@ff0a21b]: (no justification provided)
[01:54:05] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@ff0a21b]: (no justification provided) (duration: 00m 20s)
[01:55:39] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@ff0a21b]: (no justification provided)
[01:55:58] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@ff0a21b]: (no justification provided) (duration: 00m 19s)
[01:58:45] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@ff0a21b]: (no justification provided)
[01:59:07] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@ff0a21b]: (no justification provided) (duration: 00m 22s)
[02:00:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:07] <jinxer-wm>	 (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:11:39] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:08] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1112 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:31:39] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:40] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1112 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:46:35] <wikibugs>	 (03PS1) 10Andrew Bogott: wmsc-backup: correct ids passed for differential image backup [puppet] - 10https://gerrit.wikimedia.org/r/949610
[02:48:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:49:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmsc-backup: correct ids passed for differential image backup [puppet] - 10https://gerrit.wikimedia.org/r/949610 (owner: 10Andrew Bogott)
[02:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:52:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:08:33] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T344394 (10phaultfinder)
[03:28:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:42] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:35:46] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1112 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:44:30] <wikibugs>	 (03PS1) 10Stang: zhwiki: Create abusefilter-helper group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949612 (https://phabricator.wikimedia.org/T344398)
[03:44:32] <wikibugs>	 (03PS1) 10Stang: zhwiki: Remove abusefilter-(log|view)-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398)
[03:56:46] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1112 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:13:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:27:33] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-08-14-091804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/949619 (https://phabricator.wikimedia.org/T336683)
[04:29:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:33:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:33:38] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:55:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (96) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[05:30:26] <wikibugs>	 10SRE, 10PyBal, 10Scap, 10Traffic, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10Joe) 05Open→03Resolved a:03Joe The problem that caused this outage has been fixed.
[05:33:22] <wikibugs>	 10SRE-OnFire, 10Incident Tooling, 10Patch-For-Review, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10CodeReviewBot) oblivian merged https://gitlab.wikimedia.org/repos/sre/vopsbot/-/merge_requests/11  Allow users to be part of multiple teams
[05:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:48:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/948693 (https://phabricator.wikimedia.org/T343957) (owner: 10Cwhite)
[05:52:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Retire role::spare::system - https://phabricator.wikimedia.org/T324475 (10MoritzMuehlenhoff)
[05:52:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff) 05Open→03Declined The server will be decommissioned with the rest old old-esams via T343957
[05:54:54] <wikibugs>	 (03PS1) 10Ayounsi: esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219)
[05:55:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[05:56:03] <wikibugs>	 (03PS2) 10Ayounsi: esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T0600). Please do the needful.
[06:01:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:04:08] <wikibugs>	 (03CR) 10Stevemunene: datahub: Enable OIDC to idp_test (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[06:07:08] <jinxer-wm>	 (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:12:30] <wikibugs>	 (03PS1) 10Dreamy Jazz: Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949573 (https://phabricator.wikimedia.org/T344403)
[06:13:15] <wikibugs>	 (03PS1) 10Dreamy Jazz: Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949574 (https://phabricator.wikimedia.org/T344403)
[06:17:01] <wikibugs>	 (03PS1) 10Ayounsi: old esams cleanup [homer/public] - 10https://gerrit.wikimedia.org/r/949623 (https://phabricator.wikimedia.org/T329219)
[06:25:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/949623 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[06:25:38] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] old esams cleanup [homer/public] - 10https://gerrit.wikimedia.org/r/949623 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[06:26:10] <wikibugs>	 (03Merged) 10jenkins-bot: old esams cleanup [homer/public] - 10https://gerrit.wikimedia.org/r/949623 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[06:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:31:39] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job trafficserver-text in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:38:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[06:39:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make install3003 the new install server for esams [puppet] - 10https://gerrit.wikimedia.org/r/949552 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[06:42:06] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[06:48:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new dummy keytab for install3003 [labs/private] - 10https://gerrit.wikimedia.org/r/949624
[06:49:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Add new dummy keytab for install3003 and remove install3002 [labs/private] - 10https://gerrit.wikimedia.org/r/949624
[06:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:59:04] <_joe_>	 !log updated vopsbot on the icinga hosts T344316
[06:59:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:08] <stashbot>	 T344316: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316
[07:00:06] <jouncebot>	 Amir1, apergos, and jnuche: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T0700). Please do the needful.
[07:00:06] <jouncebot>	 aanzx: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:58] <apergos>	 morning!  there are no trainees signed up for today's morning backport window. I do see two patches, one a cherry pick from a patch not yet +2'ed, so I have concerns about that ( unsure of irc handle? )  and the other labelled as a wip and needing code review, so I have concerns about thta too (  aanzx )
[07:01:39] <aanzx>	 apergos: marked it as active 
[07:02:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] miscweb: add wikiworkshop and reasearch-landing-page to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/948998 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[07:02:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T344405 (10ppenloglou)
[07:03:15] <apergos>	 aanzx:  thanks.  we don't code review patches here initially, we just +2 them for deployment after a review as a general rule
[07:03:22] <apergos>	 so that still needs to be done for your config change
[07:03:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10ppenloglou)
[07:04:16] <apergos>	 if anyone knows the other patch owner (User:Dreamy Jazz) on irc, please speak up now, I don't know their nick here
[07:04:30] <kart_>	 Dreamy_Jazz: ^^
[07:04:49] <Dreamy_Jazz>	 \o
[07:04:50] <kart_>	 I just updated deployment patch (It had no name)
[07:04:57] <Dreamy_Jazz>	 Apologies. Didn't hear a ping.
[07:05:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Point the esams webproxy to install3003 [dns] - 10https://gerrit.wikimedia.org/r/949628 (https://phabricator.wikimedia.org/T344355)
[07:05:40] <apergos>	 Dreamy_Jazz:  your patch set for deployment is cherry picked from a patch without code review in master, so you need to get that sorted, or someone does, before deployment 
[07:06:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10ppenloglou) Hey @Marostegui ! Just an FYI I've made a [[ https://phabricator.wikimedia.org/T344405 | new request ]] here, relevant to what we did in this request.
[07:06:23] <Dreamy_Jazz>	 Was hoping to find zabe and/or taavi today at wikimania to get it merged, but didn't see them
[07:06:37] <Dreamy_Jazz>	 The issue is currently breaking Special:Investigate
[07:06:43] <wikibugs>	 (03CR) 10Anzx: [C: 03+1] suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[07:06:54] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[07:07:27] <wikibugs>	 (03Merged) 10jenkins-bot: esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[07:07:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point the esams webproxy to install3003 [dns] - 10https://gerrit.wikimedia.org/r/949628 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[07:08:11] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:08:54] <apergos>	 Dreamy_Jazz: it will have to wait, there is a later window if you can't find them yet
[07:09:11] <Dreamy_Jazz>	 I've just seen taavi. Will ask them now.
[07:09:39] <apergos>	 ok!
[07:10:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) Thanks - it will be processed by our [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty | clinic duty ]] assigned person this week :)
[07:10:40] <wikibugs>	 (03CR) 10JMeybohm: aux: add tlsHostnames for jaeger collector and query (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[07:11:17] <Dreamy_Jazz>	 taavi has +2'd the patch
[07:11:27] <Dreamy_Jazz>	 apergos: (for the above)
[07:11:34] <apergos>	 ok
[07:12:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on install3002.wikimedia.org with reason: decom in progress
[07:12:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on install3002.wikimedia.org with reason: decom in progress
[07:12:42] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=756bda9d-0fe5-407f-8e34-35d788d9ab8c) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and...
[07:13:04] <wikibugs>	 (03CR) 10Minato826: [C: 03+1] suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[07:15:48] <taavi>	 apergos: are you deploying the backports or should I?
[07:16:05] <apergos>	 I was going to but feel free to do the one for Dreamy_Jazz if you like
[07:16:10] <apergos>	 taavi: 
[07:16:19] <taavi>	 will do, given they're sitting right next to me
[07:16:28] <apergos>	 oh :-D  sure then
[07:16:53] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949574 (https://phabricator.wikimedia.org/T344403) (owner: 10Dreamy Jazz)
[07:16:55] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949573 (https://phabricator.wikimedia.org/T344403) (owner: 10Dreamy Jazz)
[07:17:06] <apergos>	 just lemme know when you are done, I am looking at aanzx's patch now
[07:17:51] <taavi>	 or I can do that while the checkuser patches are still in CI
[07:18:05] <taavi>	 aanzx: still around?
[07:19:29] <wikibugs>	 (03PS1) 10Majavah: Set WRITE_BOTH for OAuth multiple devices to checkuserwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949629 (https://phabricator.wikimedia.org/T242031)
[07:20:38] <apergos>	 taavi: 
[07:20:47] <apergos>	 let me dm you
[07:22:41] <apergos>	 so just to clarify for people following along
[07:22:44] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: add wikiworkshop and reasearch-landing-page to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/948998 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[07:23:05] <apergos>	 the job of a backport window runner is to get a patch out to production after it has gone through code review
[07:23:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install3003.wikimedia.org
[07:23:21] <apergos>	 while this might seem like a formaility, it's there for a reason
[07:23:33] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add wikiworkshop and reasearch-landing-page to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/948998 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[07:24:00] <koi>	 hi, is there still place for a config change?
[07:24:10] <apergos>	 and we do want the review to be meaningful, so self-review or review from someone who hasn't ever reviewed things, just to get the +1 on there, doesn't pass the bar.  this is not red tape for its own sake, we're just trying to ...
[07:24:27] <apergos>	 be good about having code vetted before it goes live.  thanks!
[07:24:34] <apergos>	 koi:  yes, let's see it!
[07:26:18] <koi>	 apergos, added to the calendar
[07:26:48] <RhinosF1>	 apergos: it's always good to be confident about what you're deploying
[07:27:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install3003.wikimedia.org
[07:27:51] <apergos>	 koi:  anybody that can +1 that?
[07:28:07] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[07:29:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:30:10] <RhinosF1>	 apergos: I can check and +1 in a few minutes
[07:30:20] <RhinosF1>	 I'm just reading the task. It's a permissions change.
[07:30:30] <apergos>	 I mean it's straightforward enough
[07:30:51] <RhinosF1>	 koi: consensus?
[07:31:01] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[07:31:03] <RhinosF1>	 apergos: I don't see any community discussion linked though.
[07:31:07] <koi>	 there's a link in the description
[07:31:47] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] "code wise fine, please add consensus link to task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949612 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang)
[07:31:58] <wikibugs>	 (03Merged) 10jenkins-bot: Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949574 (https://phabricator.wikimedia.org/T344403) (owner: 10Dreamy Jazz)
[07:32:01] <wikibugs>	 (03Merged) 10jenkins-bot: Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949573 (https://phabricator.wikimedia.org/T344403) (owner: 10Dreamy Jazz)
[07:32:02] <logmsgbot>	 !log gehel@cumin1001 conftool action : set/pooled=no; selector: name=cloudelastic1006.wikimedia.org
[07:32:03] <RhinosF1>	 Ah yes at bottom
[07:32:17] <koi>	 thanks a lot RhinosF1 
[07:32:20] <gehel>	 !log restarting elasticsearch on cloudelastic1006 (high GC)
[07:32:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:38] <RhinosF1>	 koi: that looks to exist
[07:32:42] <RhinosF1>	 So I'm happy
[07:32:44] <RhinosF1>	 apergos:
[07:33:00] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:949573|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]], [[gerrit:949574|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]]
[07:33:03] <stashbot>	 T344403: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to the local wiki, but it belongs to 'wikidatawiki' - https://phabricator.wikimedia.org/T344403
[07:34:43] <logmsgbot>	 !log taavi@deploy1002 dreamyjazz and taavi: Backport for [[gerrit:949573|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]], [[gerrit:949574|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible vi
[07:34:43] <logmsgbot>	 a k8s-experimental XWD option)
[07:35:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install3002.wikimedia.org
[07:35:36] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[07:36:03] <logmsgbot>	 !log taavi@deploy1002 dreamyjazz and taavi: Continuing with sync
[07:36:03] <Dreamy_Jazz>	 Test complete.
[07:36:39] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job trafficserver-text in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:37:08] <jinxer-wm>	 (ProbeDown) resolved: (14) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:37:19] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[07:37:22] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fae8ef40280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki
[07:37:22] <icinga-wm>	 imedia.org/wiki/Search%23Administration
[07:38:15] <gehel>	 ^ restart in progress (taking longer than expected), should resolve in a few minutes
[07:38:52] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 844, active_shards: 1396, relocating_shards: 0, initializing_shards: 30, unassigned_shards: 185, delayed_unassigned_shards: 0, number_of_pending_tasks: 6, number_of_in_
[07:38:53] <icinga-wm>	 etch: 0, task_max_waiting_in_queue_millis: 8341, active_shards_percent_as_number: 86.6542520173805 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:39:15] <apergos>	 taavi:  are you done scapping things around? can I proceed with aanzx's patch?
[07:39:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:39:24] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:39:26] <logmsgbot>	 !log gehel@cumin1001 conftool action : set/pooled=yes; selector: name=cloudelastic1006.wikimedia.org
[07:39:32] <taavi>	 scap's still working on it, just a moment
[07:39:39] <apergos>	 ah, my bad, no worries
[07:40:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[07:40:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:43:31] <apergos>	 er not aanzx's patch, my bad, that would be koi's patch, coming up next (and last)
[07:44:26] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949573|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]], [[gerrit:949574|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]] (duration: 11m 25s)
[07:44:30] <stashbot>	 T344403: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to the local wiki, but it belongs to 'wikidatawiki' - https://phabricator.wikimedia.org/T344403
[07:45:29] <taavi>	 apergos: now done
[07:45:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:45:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[07:45:36] <apergos>	 sweet!
[07:45:46] <apergos>	 koi, proceeding with your patch now
[07:45:56] <koi>	 got it
[07:46:06] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] zhwiki: Create abusefilter-helper group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949612 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang)
[07:46:45] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Create abusefilter-helper group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949612 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang)
[07:47:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:47:47] <logmsgbot>	 !log ariel@deploy1002 Started scap: Backport for [[gerrit:949612|zhwiki: Create abusefilter-helper group (T344398)]]
[07:47:50] <stashbot>	 T344398: Create abusefilter helper group on zhwiki - https://phabricator.wikimedia.org/T344398
[07:48:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[07:48:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:48:11] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts install3002.wikimedia.org
[07:48:23] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install3002.wikimedia.org` - install3002.wikimedia.org (**FAIL**)   -...
[07:49:37] <logmsgbot>	 !log ariel@deploy1002 stang and ariel: Backport for [[gerrit:949612|zhwiki: Create abusefilter-helper group (T344398)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:49:52] <apergos>	 koi:  please test your change on mwdebug1002
[07:49:57] <koi>	 looking
[07:49:58] <wikibugs>	 (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[07:51:22] <koi>	 apergos, i tested in https://zh.wikipedia.org/wiki/Special:Listgrouprights and it looks good
[07:51:38] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:51:40] <apergos>	 proceeding
[07:51:44] <logmsgbot>	 !log ariel@deploy1002 stang and ariel: Continuing with sync
[07:53:43] <wikibugs>	 (03PS5) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361)
[07:57:02] <apergos>	 watching php-fpm restarts is the new zuul-watching
[07:57:31] <wikibugs>	 (03PS6) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361)
[07:57:40] <wikibugs>	 (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[07:58:10] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:58:11] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:59:06] <logmsgbot>	 !log ariel@deploy1002 Finished scap: Backport for [[gerrit:949612|zhwiki: Create abusefilter-helper group (T344398)]] (duration: 11m 18s)
[07:59:10] <stashbot>	 T344398: Create abusefilter helper group on zhwiki - https://phabricator.wikimedia.org/T344398
[07:59:22] <apergos>	 koi:  your change is live in production, please test it there
[07:59:35] <wikibugs>	 (03PS7) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361)
[08:00:05] <koi>	 apergos, it works well, thanks!
[08:00:19] <apergos>	 great! and with that, today's backport window comes to a close
[08:01:39] <apergos>	 !log UTC morning backport and config window done
[08:01:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:46] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[08:03:13] <wikibugs>	 (03PS8) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361)
[08:03:44] <wikibugs>	 (03PS9) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361)
[08:04:09] <wikibugs>	 (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[08:04:41] * kart_ updating cxserver 
[08:05:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Acked by Valentín on IRC; I'll go ahead and deploy." [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh)
[08:05:52] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-08-14-091804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/949619 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry)
[08:06:41] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-08-14-091804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/949619 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry)
[08:07:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10Clement_Goubert) 05In progress→03Resolved
[08:07:24] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[08:07:46] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[08:08:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir3002.esams.wmnet
[08:11:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] admin: add fab to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/948693 (https://phabricator.wikimedia.org/T343957) (owner: 10Cwhite)
[08:13:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:13:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert Patch merged, the access should be deployed by puppet in the next half-hour. Boldly resolving, feel...
[08:14:03] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[08:14:36] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[08:14:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:15:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:16:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:16:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:16:21] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ncredir3002.esams.wmnet
[08:16:33] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir3002.esams.wmnet` - ncredir3002.esams.wmnet (**FAIL**)   - Dow...
[08:16:52] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[08:16:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Clement_Goubert) @Mabualruz The out of band verification of your SSH public key is still required as well.
[08:17:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir3001.esams.wmnet
[08:17:24] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[08:18:30] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[08:19:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527 (owner: 10Jbond)
[08:21:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10Clement_Goubert) @lojo_wmde I can see L3 has been signed, however we still need your public SSH key, both here on the ticket and on your wikitech user page for out-of-band verific...
[08:21:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:23:35] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus3002.esams.wmnet
[08:23:44] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[08:24:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3001.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:24:47] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[08:25:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3001.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:25:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:25:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir3001.esams.wmnet
[08:25:45] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir3001.esams.wmnet` - ncredir3001.esams.wmnet (**PASS**)   - Dow...
[08:28:00] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.dns.netbox
[08:29:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti3001.esams.wmnet
[08:29:18] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:29:19] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts prometheus3002.esams.wmnet
[08:29:30] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: `prometheus3002.esams.wmnet` - prometheus3002.esams.wmnet (**FAIL*...
[08:31:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Puppet references for ganeti3001-3003 [puppet] - 10https://gerrit.wikimedia.org/r/949836 (https://phabricator.wikimedia.org/T344363)
[08:31:49] <kart_>	 !log Updated cxserver to 2023-08-14-091804-production (T336683, T343211)
[08:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:55] <stashbot>	 T336683: Enable MinT support for languages with no Wikipedia yet - https://phabricator.wikimedia.org/T336683
[08:31:55] <stashbot>	 T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211
[08:31:59] <kart_>	 (Sorry, forgot to log ^^ earlier)
[08:33:00] <wikibugs>	 10SRE-OnFire, 10Incident Tooling, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10Joe) 05Open→03Resolved
[08:33:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert)
[08:33:15] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10fgiunchedi)
[08:34:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:38:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3001.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:40:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3001.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:40:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:40:59] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti3001.esams.wmnet
[08:41:07] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:41:09] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus3003.esams.wmnet
[08:41:10] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.dns.netbox
[08:41:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti3002.esams.wmnet
[08:42:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references for ganeti3001-3003 [puppet] - 10https://gerrit.wikimedia.org/r/949836 (https://phabricator.wikimedia.org/T344363) (owner: 10Muehlenhoff)
[08:43:12] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3003.esams.wmnet - filippo@cumin1001"
[08:43:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Out with prometheus3002, in with prometheus3003 [puppet] - 10https://gerrit.wikimedia.org/r/949837 (https://phabricator.wikimedia.org/T344355)
[08:43:56] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3003.esams.wmnet - filippo@cumin1001"
[08:43:57] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:43:57] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3003.esams.wmnet on all recursors
[08:44:00] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3003.esams.wmnet on all recursors
[08:44:20] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3003.esams.wmnet - filippo@cumin1001"
[08:44:25] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] httpd-fcgi: fix double logging issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947829 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto)
[08:44:45] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:45:22] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3003.esams.wmnet - filippo@cumin1001"
[08:45:36] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host prometheus3003.esams.wmnet with OS bullseye
[08:45:41] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[08:45:51] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host prometheus3003.esams.wmnet with OS bullseye
[08:47:24] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmnet: use prometheus3003 in esams [dns] - 10https://gerrit.wikimedia.org/r/949838 (https://phabricator.wikimedia.org/T344355)
[08:47:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:49:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:51:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:51:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:51:01] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti3002.esams.wmnet
[08:51:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti3003.esams.wmnet
[08:53:40] <wikibugs>	 (03CR) 10Clément Goubert: "Questions inline" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto)
[08:54:09] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:55:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) SSH key confirmed through https://wikitech.wikimedia.org/wiki/User:Panagiotis_Penloglou No group membership necessary as per T335353 To be completely...
[08:55:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:55:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert)
[08:55:59] <wikibugs>	 (03PS1) 10Clément Goubert: admin: New ssh key for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/949839 (https://phabricator.wikimedia.org/T344405)
[08:58:25] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) 05Open→03In progress
[09:00:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/949837 (https://phabricator.wikimedia.org/T344355) (owner: 10Filippo Giunchedi)
[09:00:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert) 05Open→03In progress
[09:01:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10Clement_Goubert) 05Open→03In progress
[09:01:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert) 05Open→03In progress
[09:01:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/949838 (https://phabricator.wikimedia.org/T344355) (owner: 10Filippo Giunchedi)
[09:02:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Out with prometheus3002, in with prometheus3003 [puppet] - 10https://gerrit.wikimedia.org/r/949837 (https://phabricator.wikimedia.org/T344355) (owner: 10Filippo Giunchedi)
[09:02:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[09:03:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[09:03:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:03:50] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti3003.esams.wmnet
[09:04:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: use prometheus3003 in esams [dns] - 10https://gerrit.wikimedia.org/r/949838 (https://phabricator.wikimedia.org/T344355) (owner: 10Filippo Giunchedi)
[09:09:56] <wikibugs>	 (03PS1) 10Muehlenhoff: netbox: Disable ganeti sync for old esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/949841
[09:10:22] <wikibugs>	 (03PS8) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717)
[09:10:50] <wikibugs>	 (03Abandoned) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[09:11:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: stop creating the volatile/misc folder [puppet] - 10https://gerrit.wikimedia.org/r/949554 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[09:11:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] netbox: Disable ganeti sync for old esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/949841 (owner: 10Muehlenhoff)
[09:12:31] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] netbox: Disable ganeti sync for old esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/949841 (owner: 10Muehlenhoff)
[09:13:58] <wikibugs>	 (03PS1) 10Jelto: trafficserver: switch wikiworkshop.org and research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/949842 (https://phabricator.wikimedia.org/T334511)
[09:19:11] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus3003.esams.wmnet with reason: host reimage
[09:19:53] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:20:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[09:22:32] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus3003.esams.wmnet with reason: host reimage
[09:22:41] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for ganeti01.svc.esams.wmnet - cmooney@cumin1001"
[09:22:42] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[09:23:31] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for ganeti01.svc.esams.wmnet - cmooney@cumin1001"
[09:23:31] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:23:50] <wikibugs>	 (03PS1) 10Jbond: P:config-master: proixy_sha1 variable needs to be added to vhost_settings [puppet] - 10https://gerrit.wikimedia.org/r/949844 (https://phabricator.wikimedia.org/T341717)
[09:24:11] <wikibugs>	 (03PS1) 10Jelto: miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511)
[09:24:28] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[09:24:42] <wikibugs>	 (03PS2) 10Jbond: P:config-master: proixy_sha1 variable needs to be added to vhost_settings [puppet] - 10https://gerrit.wikimedia.org/r/949844 (https://phabricator.wikimedia.org/T341717)
[09:26:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42906/console" [puppet] - 10https://gerrit.wikimedia.org/r/949844 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[09:26:58] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for ganeti01.svc.esams.wmnet - cmooney@cumin1001"
[09:27:43] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for ganeti01.svc.esams.wmnet - cmooney@cumin1001"
[09:27:43] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:28:50] <wikibugs>	 10sre-alert-triage, 10Data-Platform-SRE: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10gmodena) That’s a false positive, we don’t have active traffic in codfw. There was WIP to fix it before I went on PTO but I guess i...
[09:28:57] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1106.eqiad.wmnet with OS bullseye
[09:29:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1107.eqiad.wmnet with OS bullseye
[09:30:45] <logmsgbot>	 !log btullis@deploy1002 Started deploy [airflow-dags/analytics@ff0a21b]: (no justification provided)
[09:31:08] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@ff0a21b]: (no justification provided) (duration: 00m 22s)
[09:32:07] <jinxer-wm>	 (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:32:59] <effie>	 looking
[09:34:49] <effie>	 _joe_: shall we wait a bit on this alert? it is on esams
[09:35:15] <effie>	 !log temporarily pooling kartotherian on codfw 
[09:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:04] <godog>	 yes I thought site=esams had been silenced, it wasn't
[09:36:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:config-master: proixy_sha1 variable needs to be added to vhost_settings [puppet] - 10https://gerrit.wikimedia.org/r/949844 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[09:36:16] <godog>	 effie _joe_ that's the prometheus host coming online, I'll silence
[09:36:32] <effie>	 cool godog tx 
[09:36:36] <jinxer-wm>	 (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:36:53] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus3003.esams.wmnet with OS bullseye
[09:36:53] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus3003.esams.wmnet
[09:36:55] <_joe_>	 effie: I still didn't get the pages...
[09:37:05] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host prometheus3003.esams.wmnet with OS bullseye completed: - promethe...
[09:37:34] <_joe_>	 sorry if I didn't react earlier
[09:37:42] <effie>	 _joe_:  it has not been 5' already so 
[09:37:54] <_joe_>	 effie: well you were paged right?
[09:38:02] <_joe_>	 as in the page was delivered to you
[09:39:16] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: add trafficserver-side mangling to rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/949846 (https://phabricator.wikimedia.org/T344358)
[09:39:45] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
[09:40:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 (10jbond)
[09:40:26] <_joe_>	 oh wait, this did not send a page to victorops
[09:40:34] <wikibugs>	 (03PS1) 10Jbond: config-master: add proxy modules to httpd [puppet] - 10https://gerrit.wikimedia.org/r/949847 (https://phabricator.wikimedia.org/T341717)
[09:40:38] <_joe_>	 !incidents
[09:40:39] <sirenbot>	 3951 (ACKED)  [12x] ProbeDown sre (probes/service esams)
[09:40:51] <_joe_>	 yep that is indeed old
[09:42:13] <effie>	 _joe_: no, I saw irc 
[09:42:32] <_joe_>	 oh, ok. 
[09:42:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] config-master: add proxy modules to httpd [puppet] - 10https://gerrit.wikimedia.org/r/949847 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[09:43:19] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1106.eqiad.wmnet with reason: host reimage
[09:44:04] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] p:tlsproxy::envoy: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro)
[09:46:21] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1106.eqiad.wmnet with reason: host reimage
[09:46:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42910/console" [puppet] - 10https://gerrit.wikimedia.org/r/949847 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[09:48:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42911/console" [puppet] - 10https://gerrit.wikimedia.org/r/949847 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[09:49:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10ppenloglou) Thank you @Clement_Goubert ! I'll let Danny know but he's OOO this week, so we can pick this up next week ;)
[09:51:42] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Growth: Enable user research opt-in checkbox on few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949849 (https://phabricator.wikimedia.org/T342353)
[09:55:22] <wikibugs>	 (03PS1) 10Effie Mouzeli: tegola-vector-tiles: update application image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949852 (https://phabricator.wikimedia.org/T344324)
[09:55:37] <wikibugs>	 (03PS1) 10Ayounsi: Homer: remove all mentions of old esams [homer/public] - 10https://gerrit.wikimedia.org/r/949853 (https://phabricator.wikimedia.org/T329219)
[09:57:41] <wikibugs>	 (03PS1) 10David Caro: toolsdb: add skipped table to the config [puppet] - 10https://gerrit.wikimedia.org/r/949854 (https://phabricator.wikimedia.org/T344411)
[09:58:33] <wikibugs>	 (03PS1) 10Jbond: configmaster: pass puppet_ca_server via vhost settings [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717)
[09:58:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] configmaster: pass puppet_ca_server via vhost settings [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[10:00:05] <jouncebot>	 mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1000).
[10:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1000)
[10:00:27] <godog>	 effie _joe_ I'll resolve incident 3951 since it is acked only and will re-page if left alone
[10:00:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42913/console" [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[10:00:38] <effie>	 godog: <3
[10:00:45] <_joe_>	 godog: yeah I assumed it would be solved today
[10:01:11] <godog>	 yeah defo not
[10:01:44] <wikibugs>	 (03PS1) 10Urbanecm: revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949576 (https://phabricator.wikimedia.org/T316079)
[10:02:12] <wikibugs>	 (03PS1) 10Urbanecm: revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949577 (https://phabricator.wikimedia.org/T316079)
[10:02:53] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[10:04:08] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "The cassandra script looks ok, but take this with a grain of salt" [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[10:05:43] <wikibugs>	 (03PS5) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033)
[10:07:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: update application image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949852 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli)
[10:07:15] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] trafficserver: switch wikiworkshop.org and research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/949842 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[10:07:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[10:07:42] <wikibugs>	 (03PS2) 10Jbond: configmaster: pass puppet_ca_server via vhost settings [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717)
[10:07:50] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: update application image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949852 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli)
[10:08:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[10:08:27] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[10:09:11] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1107.eqiad.wmnet with OS bullseye
[10:09:59] <moritzm>	 !log installing ghostscript security updates
[10:10:01] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:10:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1106.eqiad.wmnet with OS bullseye
[10:10:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1107.eqiad.wmnet with OS bullseye
[10:12:24] <wikibugs>	 (03PS1) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219)
[10:13:07] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[10:13:38] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[10:13:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: python3 compatibility [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949932
[10:13:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Convert to python3, bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949933
[10:13:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[10:14:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Convert to python3, bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949933
[10:14:20] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: python3 compatibility [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949932 (owner: 10Giuseppe Lavagetto)
[10:15:32] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply
[10:16:05] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[10:16:55] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Fabfur)
[10:17:23] <wikibugs>	 (03PS1) 10Ayounsi: Remove all mentions of old-esams, replace with new esams [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219)
[10:19:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949933 (owner: 10Giuseppe Lavagetto)
[10:20:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Convert to python3, bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949933 (owner: 10Giuseppe Lavagetto)
[10:21:41] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[10:22:20] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[10:22:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3007.esams.wmnet to cluster esams01 and group BY27
[10:23:04] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3007.esams.wmnet to cluster esams01 and group BY27
[10:23:11] <effie>	 !log depool kartotherian (maps) codfw - T344324
[10:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:14] <stashbot>	 T344324: Maps Unavailability (14 Aug 2023) - https://phabricator.wikimedia.org/T344324
[10:23:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:23:23] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
[10:23:27] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1107.eqiad.wmnet with reason: host reimage
[10:23:34] <wikibugs>	 (03CR) 10Ayounsi: "Running PCC on all the fleet: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42915/" [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[10:26:35] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1107.eqiad.wmnet with reason: host reimage
[10:28:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:28:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Rules LGTM, I'm adding Keith since AFAICS the sli/slo rules live in modules/profile/files/thanos/recording_rules.yaml (i.e. they are globa" [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[10:28:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Reinstate ganeti Netbox sync for esams01 [puppet] - 10https://gerrit.wikimedia.org/r/949935
[10:29:10] <_joe_>	 uhm high errors on parsoid
[10:30:49] <effie>	 so far I only see timeouts 
[10:31:08] <_joe_>	 yep
[10:31:39] <_joe_>	 and OOMs
[10:32:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: aux: add tlsHostnames for jaeger collector and query (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[10:32:35] <wikibugs>	 (03PS2) 10Filippo Giunchedi: aux: add tlsHostnames for jaeger collector and query [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253)
[10:33:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:34:45] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/949853 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[10:35:13] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[10:35:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Reinstate ganeti Netbox sync for esams01 [puppet] - 10https://gerrit.wikimedia.org/r/949935 (owner: 10Muehlenhoff)
[10:36:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Homer: remove all mentions of old esams [homer/public] - 10https://gerrit.wikimedia.org/r/949853 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[10:36:43] <wikibugs>	 (03Merged) 10jenkins-bot: Homer: remove all mentions of old esams [homer/public] - 10https://gerrit.wikimedia.org/r/949853 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[10:37:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Reinstate ganeti Netbox sync for esams01 [puppet] - 10https://gerrit.wikimedia.org/r/949935 (owner: 10Muehlenhoff)
[10:44:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[10:45:10] <wikibugs>	 (03PS1) 10Ssingh: 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219)
[10:46:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[10:47:45] <wikibugs>	 (03PS2) 10Ssingh: 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219)
[10:49:18] <sukhe>	 authdns-update is currently broken, fixing it ^
[10:49:25] <sukhe>	 just as an FYI
[10:49:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1107.eqiad.wmnet with OS bullseye
[10:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:53:29] <wikibugs>	 (03CR) 10Ayounsi: 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[10:53:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3007.esams.wmnet to cluster esams01 and group BY27
[10:53:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[10:53:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[10:54:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3007.esams.wmnet to cluster esams01 and group BY27
[10:54:18] <sukhe>	 !log run authdns-update for CR 949938
[10:54:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removing DNS names for ae1-103.cr2-esams and vrrp-gw-103 - sukhe@cumin2002"
[10:56:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removing DNS names for ae1-103.cr2-esams and vrrp-gw-103 - sukhe@cumin2002"
[10:56:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:02:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949941 (https://phabricator.wikimedia.org/T344355)
[11:03:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949941 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[11:04:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Adapt the control file to bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949942
[11:04:51] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[11:04:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Adapt the control file to bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949942 (owner: 10Giuseppe Lavagetto)
[11:04:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3003.esams.wmnet
[11:05:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:05:11] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: httpd-fcgi: fix double logging issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947829 (https://phabricator.wikimedia.org/T340935)
[11:05:13] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935)
[11:05:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: fix double logging issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947829 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto)
[11:05:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto)
[11:06:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3003.esams.wmnet - jmm@cumin2002"
[11:07:09] <wikibugs>	 (03PS2) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219)
[11:07:12] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[11:07:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3003.esams.wmnet - jmm@cumin2002"
[11:07:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:07:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3003.esams.wmnet on all recursors
[11:07:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3003.esams.wmnet on all recursors
[11:07:51] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935)
[11:08:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[11:08:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3003.esams.wmnet - jmm@cumin2002"
[11:08:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto)
[11:08:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3003.esams.wmnet - jmm@cumin2002"
[11:09:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto)
[11:09:26] <wikibugs>	 (03PS3) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219)
[11:10:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[11:10:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3003.esams.wmnet with OS bullseye
[11:10:59] <wikibugs>	 (03PS4) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219)
[11:11:04] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:11:32] <_joe_>	 jouncebot: nowandnext
[11:11:32] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 48 minute(s)
[11:11:32] <jouncebot>	 In 0 hour(s) and 48 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1200)
[11:11:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:12:05] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:12:42] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:12:51] <wikibugs>	 (03PS1) 10MVernon: hiera: add swift user search_update_pipeline [puppet] - 10https://gerrit.wikimedia.org/r/949943 (https://phabricator.wikimedia.org/T342620)
[11:12:55] <wikibugs>	 (03PS1) 10MVernon: hiera: add fake credential for swift user search_update_pipeline [labs/private] - 10https://gerrit.wikimedia.org/r/949944 (https://phabricator.wikimedia.org/T342620)
[11:13:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:06] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:15:07] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/949587 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans)
[11:16:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:16:28] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:16:38] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:16:57] <wikibugs>	 (03PS1) 10Hnowlan: aqs: enable geo_analytics user [puppet] - 10https://gerrit.wikimedia.org/r/949947 (https://phabricator.wikimedia.org/T336400)
[11:17:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10SDeckelmann-WMF) I approve.
[11:19:53] <wikibugs>	 (03PS5) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219)
[11:21:06] <wikibugs>	 (03CR) 10Ssingh: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[11:21:55] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:22:09] <wikibugs>	 (03CR) 10Btullis: "Removing my +1 because I've just thought of something that's going to make it fail." [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[11:22:28] <wikibugs>	 (03PS4) 10Clément Goubert: Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[11:23:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert)
[11:25:20] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[11:27:06] <wikibugs>	 (03CR) 10Ssingh: Remove all mentions of old-esams, replace with new esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[11:27:12] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:27:14] <wikibugs>	 (03CR) 10Jbond: "follow up comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans)
[11:27:43] <wikibugs>	 (03Merged) 10jenkins-bot: Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[11:29:12] <wikibugs>	 (03CR) 10Ssingh: Use only active authdns hosts for DNS changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans)
[11:29:37] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:29:44] <wikibugs>	 (03PS2) 10Ayounsi: Remove all mentions of old-esams, replace with new esams [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219)
[11:29:53] <wikibugs>	 (03CR) 10Ayounsi: Remove all mentions of old-esams, replace with new esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[11:31:28] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:31:39] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:32:18] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:32:30] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:33:58] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:34:11] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:35:00] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:36:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:38:10] <wikibugs>	 (03CR) 10Jelto: "I'll postpone deployment of the config change to the next maintenance which requires restart/reboot." [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto)
[11:39:25] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:53] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/949957 (https://phabricator.wikimedia.org/T342748)
[11:47:04] <wikibugs>	 (03PS1) 10Jbond: check_puppetrun: update to use failed_resources Puppet::Transaction::Report [puppet] - 10https://gerrit.wikimedia.org/r/949959 (https://phabricator.wikimedia.org/T337951)
[11:49:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM! thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/949959 (https://phabricator.wikimedia.org/T337951) (owner: 10Jbond)
[11:49:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] configmaster: pass puppet_ca_server via vhost settings [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[11:55:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] check_puppetrun: update to use failed_resources Puppet::Transaction::Report [puppet] - 10https://gerrit.wikimedia.org/r/949959 (https://phabricator.wikimedia.org/T337951) (owner: 10Jbond)
[11:59:30] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mediawiki: Set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/949957 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1200)
[12:03:33] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncredir3003.esams.wmnet with OS bullseye
[12:03:33] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir3003.esams.wmnet
[12:04:39] <jelto>	 !log restart jwt-authorizer service (docker-registry-ha-jwt.service) on registry nodes - T337474
[12:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:42] <stashbot>	 T337474: Replace deprecated `CI_JOB_JWT` CI variable in Kokkuri - https://phabricator.wikimedia.org/T337474
[12:07:14] <wikibugs>	 (03PS1) 10Muehlenhoff: ncredir: Use globbing to select the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/949962
[12:09:55] <wikibugs>	 (03PS3) 10Acamicamacaraca: Enable VisualEditor in Draft and Project namespace on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432)
[12:14:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:16:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:18] <wikibugs>	 (03PS4) 10Acamicamacaraca: Enable VisualEditor in Project and Draft namespaces on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432)
[12:20:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] knams migration: remove references to old esams (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[12:20:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[12:21:12] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[12:22:05] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[12:22:59] <icinga-wm>	 PROBLEM - config-master.wikimedia.org requires authentication on config-master1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:24:27] <icinga-wm>	 RECOVERY - config-master.wikimedia.org requires authentication on config-master1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:25:10] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:25:28] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[12:26:05] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[12:26:21] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[12:26:29] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:26:57] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[12:27:07] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[12:27:46] <wikibugs>	 (03PS1) 10Urbanecm: cross-wiki userrights: Add SpecialUserRights::getDisplayUsername [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949582 (https://phabricator.wikimedia.org/T344391)
[12:28:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:32:23] <wikibugs>	 (03CR) 10Ayounsi: "Full PCC output: https://puppet-compiler.wmflabs.org/output/949934/42915/" [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[12:32:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:15] <wikibugs>	 (03PS1) 10Jbond: config-master: enable ssl for proxies [puppet] - 10https://gerrit.wikimedia.org/r/949969 (https://phabricator.wikimedia.org/T341717)
[12:34:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] check_puppetrun: update to use failed_resources Puppet::Transaction::Report [puppet] - 10https://gerrit.wikimedia.org/r/949959 (https://phabricator.wikimedia.org/T337951) (owner: 10Jbond)
[12:36:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] config-master: enable ssl for proxies [puppet] - 10https://gerrit.wikimedia.org/r/949969 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[12:40:06] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] releases jenkins: allow Scap to disable services on secondary hosts [puppet] - 10https://gerrit.wikimedia.org/r/947814 (https://phabricator.wikimedia.org/T343447) (owner: 10Jaime Nuche)
[12:41:06] <eoghan>	 jbond: Happy for me to merge your change? 
[12:45:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] ncredir: Use globbing to select the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/949962 (owner: 10Muehlenhoff)
[12:46:36] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[12:49:58] <wikibugs>	 (03CR) 10Stevemunene: datahub: Enable OIDC to idp_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[12:55:57] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:58:03] <wikibugs>	 (03CR) 10Herron: "Effie and I are planning to deploy this in codfw shortly and monitor tegola closely before making a go/no-go decision for eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[12:59:25] <wikibugs>	 (03PS1) 10Ssingh: 10.in-addr.arpa: remove include for netbox/0.21.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949972 (https://phabricator.wikimedia.org/T329219)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1300). nyaa~
[13:00:05] <jouncebot>	 Urbanecm, aanzx, and Aca: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:15] * Aca waves
[13:00:16] <urbanecm>	 i can deploy today
[13:00:21] <aanzx>	 o/
[13:00:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] 10.in-addr.arpa: remove include for netbox/0.21.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949972 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:00:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949577 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm)
[13:00:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949576 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm)
[13:00:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] cross-wiki userrights: Add SpecialUserRights::getDisplayUsername [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949582 (https://phabricator.wikimedia.org/T344391) (owner: 10Urbanecm)
[13:01:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ncredir: Use globbing to select the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/949962 (owner: 10Muehlenhoff)
[13:01:02] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable VisualEditor in Project and Draft namespaces on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) (owner: 10Acamicamacaraca)
[13:01:10] <wikibugs>	 (03PS2) 10Muehlenhoff: ncredir: Use globbing to select the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/949962
[13:01:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) (owner: 10Acamicamacaraca)
[13:01:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949941 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[13:02:03] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VisualEditor in Project and Draft namespaces on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) (owner: 10Acamicamacaraca)
[13:04:33] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949581|Enable VisualEditor in Project and Draft namespaces on shwiki (T344432)]]
[13:04:40] <stashbot>	 T344432: Enable VisualEditor in Project and Draft namespaces on shwiki - https://phabricator.wikimedia.org/T344432
[13:04:56] <Aca>	 checking now
[13:05:29] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and aleksandar: Backport for [[gerrit:949581|Enable VisualEditor in Project and Draft namespaces on shwiki (T344432)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:05:47] <urbanecm>	 Aca: it's only available at debug servers now, but please go ahead :)
[13:06:07] <Aca>	 yeah, thats what I was referring to
[13:08:18] <urbanecm>	 k8s build phase of scap resulted in an error. it says non-k8s deployment will proceed, but any idea how to fix that? https://www.irccloud.com/pastebin/cGx7BdzE/
[13:08:18] <Aca>	 After refreshing, VisualEditor tab is now shown in the toolbar.
[13:08:26] <Aca>	 lgtm
[13:08:35] <urbanecm>	 Aca: great. waiting now on the k8s failure error.
[13:09:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3003.esams.wmnet
[13:09:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:10:07] <urbanecm>	 seems it happened while building `webserver-image-base` (while running `docker build --pull --build-arg "http_proxy=http://webproxy.eqiad.wmnet:8080" --build-arg "https_proxy=http://webproxy.eqiad.wmnet:8080" -f Dockerfile.webserver-base-image -t webserver-image-base .  `)
[13:10:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:10:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3003.esams.wmnet on all recursors
[13:10:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3003.esams.wmnet on all recursors
[13:10:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3003.esams.wmnet - jmm@cumin2002"
[13:11:13] <wikibugs>	 (03CR) 10Gehel: "I spot checked the changes between the new systemd unit and the processes running on wdqs1003 (both main and categories) and on wcqs1001. " [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[13:11:40] <wikibugs>	 (03PS10) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361)
[13:12:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3003.esams.wmnet - jmm@cumin2002"
[13:12:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3003.esams.wmnet with OS bullseye
[13:15:32] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and aleksandar: Continuing with sync
[13:15:32] <wikibugs>	 (03PS1) 10Ssingh: Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219)
[13:15:50] <urbanecm>	 proceeding, the k8s bits seems to be auto-disabled by scap. filling task.
[13:16:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Removing myself as this is on hold for now" [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse)
[13:16:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:17:49] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:19:09] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:19:50] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949581|Enable VisualEditor in Project and Draft namespaces on shwiki (T344432)]] (duration: 15m 16s)
[13:19:54] <stashbot>	 T344432: Enable VisualEditor in Project and Draft namespaces on shwiki - https://phabricator.wikimedia.org/T344432
[13:20:39] <urbanecm>	 filled the k8s bug as T344438. 
[13:20:40] <stashbot>	 T344438: scap backport fails to build a image for k8s deployment - https://phabricator.wikimedia.org/T344438
[13:20:40] <wikibugs>	 (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949577 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm)
[13:21:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:21:10] <wikibugs>	 (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949576 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm)
[13:21:13] <wikibugs>	 (03Merged) 10jenkins-bot: cross-wiki userrights: Add SpecialUserRights::getDisplayUsername [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949582 (https://phabricator.wikimedia.org/T344391) (owner: 10Urbanecm)
[13:22:01] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "Good to go for testing on codfw" [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[13:22:21] <urbanecm>	 continuing with backports, as the core one fixes an UBN, which is better to have fixed at least for the non-k8s world.
[13:23:02] <wikibugs>	 (03CR) 10Herron: [C: 03+2] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[13:23:05] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949582|cross-wiki userrights: Add SpecialUserRights::getDisplayUsername (T344391 T255309)]], [[gerrit:949577|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]], [[gerrit:949576|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]]
[13:23:11] <stashbot>	 T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079
[13:23:12] <stashbot>	 T255309: Remove UserRightsProxy and replace its usages with UserGroupManager - https://phabricator.wikimedia.org/T255309
[13:23:12] <stashbot>	 T344391: Interwiki user rights changes not being logged correctly - https://phabricator.wikimedia.org/T344391
[13:23:47] <wikibugs>	 (03PS2) 10Ayounsi: Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:23:55] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:949582|cross-wiki userrights: Add SpecialUserRights::getDisplayUsername (T344391 T255309)]], [[gerrit:949577|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]], [[gerrit:949576|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.c
[13:23:55] <logmsgbot>	 odfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:24:31] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[13:24:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:26:16] <claime>	 moritzm: Did anything change regarding gpg signing on the apt repo ?
[13:26:27] <wikibugs>	 (03PS3) 10Ayounsi: Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:26:36] <claime>	 moritzm: context https://phabricator.wikimedia.org/T344438
[13:27:24] <urbanecm>	 claime: fwiw, a very similar error happened as T338952 recently. dunno how much related those two actually are, but just in case... :)
[13:27:25] <stashbot>	 T338952: mwaddlink fails to build because of a missing public key - https://phabricator.wikimedia.org/T338952
[13:27:27] <claime>	 ah, just saw your comment urbanecm, I'll check something
[13:27:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage
[13:27:39] <urbanecm>	 sounds good :)
[13:28:19] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:28:52] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949582|cross-wiki userrights: Add SpecialUserRights::getDisplayUsername (T344391 T255309)]], [[gerrit:949577|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]], [[gerrit:949576|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]] (duration: 05m 46s)
[13:28:58] <stashbot>	 T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079
[13:28:58] <stashbot>	 T255309: Remove UserRightsProxy and replace its usages with UserGroupManager - https://phabricator.wikimedia.org/T255309
[13:28:58] <stashbot>	 T344391: Interwiki user rights changes not being logged correctly - https://phabricator.wikimedia.org/T344391
[13:29:05] * urbanecm finished in-progress deployments now.
[13:29:23] <urbanecm>	 waiting for now, as i don't want to divert k8s and non-k8s worlds even more.
[13:29:31] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:30:21] <_joe_>	 urbanecm: can you wait a sec please?
[13:30:41] <urbanecm>	 _joe_: yeah, i'm waiting. 
[13:31:10] <_joe_>	 sigh another incident on esams
[13:31:19] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[13:31:22] <_joe_>	 XioNoX / topranks cr1-esams paging agoain
[13:31:55] <sukhe>	 !incidents
[13:31:55] <sirenbot>	 3951 (RESOLVED)  [12x] ProbeDown sre (probes/service esams)
[13:32:02] <topranks>	 _joe_: hmm that’s a new box, maybe added to monitor off prematurely
[13:32:10] <XioNoX>	 they're both downtimed in icinga
[13:32:15] <_joe_>	 says "snooze expired"
[13:32:21] <_joe_>	 in victorops
[13:32:23] <XioNoX>	 _joe_: what's the page I don't see it here
[13:32:24] <XioNoX>	 ahhh
[13:32:25] <XioNoX>	 ok
[13:32:25] <_joe_>	 and now resolved
[13:32:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage
[13:32:36] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:33:01] <arnoldokoth>	 _joe_: Resolved it manually.
[13:34:47] <arnoldokoth>	 I assumed it would go away after it was down-timed but I guess not.
[13:35:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:35:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:puppetserver: add support for extra_mounts [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:36:01] <effie>	 !log pooling kartotherian (maps) on codfw - T344324
[13:36:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:05] <stashbot>	 T344324: Maps Unavailability (14 Aug 2023) - https://phabricator.wikimedia.org/T344324
[13:36:22] <wikibugs>	 (03CR) 10Esanders: "There is an ongoing discussion on the task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) (owner: 10Acamicamacaraca)
[13:36:37] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
[13:37:14] <wikibugs>	 (03PS5) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056)
[13:38:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: add volatile file mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:40:26] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:40:34] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
[13:40:42] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:41:02] <wikibugs>	 (03PS6) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219)
[13:42:09] <wikibugs>	 (03CR) 10Ssingh: "rebased to exclude changes already in master, purging of /24 and /48 PTRs" [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:42:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:42:50] <effie>	 the maps error is being handled
[13:44:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:36] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:44:46] <wikibugs>	 (03PS2) 10Andrew Bogott: wmsc-backup: correct ids passed for differential image backup [puppet] - 10https://gerrit.wikimedia.org/r/949610
[13:44:46] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[13:45:08] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:45:17] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[13:45:38] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:46:02] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:46:12] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:46:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3003.esams.wmnet with OS bullseye
[13:46:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir3003.esams.wmnet
[13:46:18] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:47:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3004.esams.wmnet
[13:47:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:47:25] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[13:47:42] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:43] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[13:48:58] <wikibugs>	 (03CR) 10JMeybohm: aux: add tlsHostnames for jaeger collector and query (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[13:49:10] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[13:50:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmsc-backup: correct ids passed for differential image backup [puppet] - 10https://gerrit.wikimedia.org/r/949610 (owner: 10Andrew Bogott)
[13:50:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3004.esams.wmnet - jmm@cumin2002"
[13:53:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Add durum300[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355)
[13:54:17] <wikibugs>	 (03CR) 10JMeybohm: "I did implement what I proposed in https://phabricator.wikimedia.org/T277876#9095795 - we can adapt the calculations if we feel reservatio" [puppet] - 10https://gerrit.wikimedia.org/r/949843 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm)
[13:54:49] <wikibugs>	 (03PS3) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056)
[13:54:51] <wikibugs>	 (03PS1) 10Jbond: puppetserver: switch to useing ca_server instead of enable_ca [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056)
[13:55:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "TLS material looking good on eqiad & codfw deployments:" [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[13:55:32] <wikibugs>	 (03CR) 10Ssingh: Add durum300[34] to site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[13:55:36] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:55:44] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:55:49] <wikibugs>	 (03CR) 10Bking: [C: 03+1] hiera: add swift user search_update_pipeline [puppet] - 10https://gerrit.wikimedia.org/r/949943 (https://phabricator.wikimedia.org/T342620) (owner: 10MVernon)
[13:55:50] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:56:22] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:56:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42919/console" [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:56:52] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:56:56] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] tests: fix CertificateState tests on python 3.10+ [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949505 (https://phabricator.wikimedia.org/T344330) (owner: 10Vgutierrez)
[13:57:38] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Re-update artificially images to overcome a docker-pkg bug [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/949979 (https://phabricator.wikimedia.org/T344438)
[13:57:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum3003.esams.wmnet
[13:57:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:57:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:57:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42920/console" [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:58:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3004.esams.wmnet - jmm@cumin2002"
[13:58:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:58:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3004.esams.wmnet on all recursors
[13:58:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3004.esams.wmnet on all recursors
[13:58:15] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Re-update artificially images to overcome a docker-pkg bug [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/949979 (https://phabricator.wikimedia.org/T344438) (owner: 10Giuseppe Lavagetto)
[13:58:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3004.esams.wmnet - jmm@cumin2002"
[13:59:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42921/console" [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:59:13] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Re-update artificially images to overcome a docker-pkg bug [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/949979 (https://phabricator.wikimedia.org/T344438)
[13:59:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3004.esams.wmnet - jmm@cumin2002"
[13:59:25] <inflatador>	 !log bking@cumin1001 'disabling puppet on wcqs/wdqs to test 949503'
[13:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:28] <wikibugs>	 (03PS2) 10Muehlenhoff: Add durum300[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355)
[13:59:43] <wikibugs>	 (03CR) 10Muehlenhoff: Add durum300[34] to site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[13:59:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Re-update artificially images to overcome a docker-pkg bug [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/949979 (https://phabricator.wikimedia.org/T344438) (owner: 10Giuseppe Lavagetto)
[14:00:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: switch to useing ca_server instead of enable_ca [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[14:00:30] <effie>	 any maps alerts will clear soon, sorry for the noise
[14:00:55] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Add durum300[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[14:01:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add durum300[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[14:01:34] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:01:49] <wikibugs>	 (03PS2) 10Vgutierrez: Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[14:02:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[14:03:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3004.esams.wmnet with OS bullseye
[14:03:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3003.esams.wmnet - jmm@cumin2002"
[14:04:08] <_joe_>	 urbanecm: I'll re-deploy to k8s
[14:04:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3003.esams.wmnet - jmm@cumin2002"
[14:04:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:04:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum3003.esams.wmnet on all recursors
[14:04:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum3003.esams.wmnet on all recursors
[14:04:39] <urbanecm>	 ack, ty. i have some patches to finish, too. 
[14:04:43] <urbanecm>	 ping me once ready for me :)
[14:04:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3003.esams.wmnet - jmm@cumin2002"
[14:05:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3003.esams.wmnet - jmm@cumin2002"
[14:05:53] <logmsgbot>	 !log oblivian@deploy1002 Started scap: (no justification provided)
[14:06:13] <_joe_>	 urbanecm: it will take some time, I'm rebuilding the images from scratch
[14:06:24] <urbanecm>	 ok, noted.
[14:06:34] * urbanecm is having fun with our CI in the meantime.
[14:06:40] <logmsgbot>	 !log oblivian@deploy1002 sync-world aborted: (no justification provided) (duration: 00m 46s)
[14:08:28] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.744 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:09:06] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 9.879 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:09:08] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:09:24] <icinga-wm>	 RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.664 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:10:40] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:10:50] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:11:40] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:45] <logmsgbot>	 !log oblivian@deploy1002 Started scap: (no justification provided)
[14:12:12] <wikibugs>	 (03PS1) 10Jbond: puppetserver: dont auto restart puppet server [puppet] - 10https://gerrit.wikimedia.org/r/949980 (https://phabricator.wikimedia.org/T330490)
[14:12:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: dont auto restart puppet server [puppet] - 10https://gerrit.wikimedia.org/r/949980 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:14:29] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez)
[14:16:30] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Krinkle)
[14:16:39] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage
[14:17:57] <wikibugs>	 (03PS1) 10Ayounsi: cr1-esams: add transit an LACP min links [homer/public] - 10https://gerrit.wikimedia.org/r/949981
[14:18:14] <wikibugs>	 (03CR) 10Vgutierrez: "looking good, please see inline comments." [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[14:19:16] <wikibugs>	 (03PS3) 10Ayounsi: Remove all mentions of old-esams, replace with new esams [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219)
[14:19:47] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] cr1-esams: add transit an LACP min links [homer/public] - 10https://gerrit.wikimedia.org/r/949981 (owner: 10Ayounsi)
[14:19:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum3003.esams.wmnet with OS bullseye
[14:20:18] <wikibugs>	 (03Merged) 10jenkins-bot: cr1-esams: add transit an LACP min links [homer/public] - 10https://gerrit.wikimedia.org/r/949981 (owner: 10Ayounsi)
[14:20:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:21:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage
[14:21:05] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:21:25] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[14:21:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2007.codfw.wmnet with reason: canary for T342361
[14:21:43] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:46] <stashbot>	 T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361
[14:21:56] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2007.codfw.wmnet with reason: canary for T342361
[14:24:30] <wikibugs>	 (03PS4) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056)
[14:27:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[14:29:20] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:29:22] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:29:56] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:33:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:34:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3004.esams.wmnet with OS bullseye
[14:34:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir3004.esams.wmnet
[14:34:28] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "Remove limits in ResourceQuota and container limitanges for mediawiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/949583
[14:36:39] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:36:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3003.esams.wmnet with reason: host reimage
[14:37:48] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: (no justification provided) (duration: 26m 03s)
[14:39:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good! Thanks. I think whatever PCC failures we have are unrelated but definitely could use another pair of eyes on this." [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[14:41:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: host reimage
[14:41:09] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "Remove limits in ResourceQuota and container limitanges for mediawiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/949583 (owner: 10Clément Goubert)
[14:43:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove limits in ResourceQuota and container limitanges for mediawiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/949583 (owner: 10Clément Goubert)
[14:44:08] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:44:36] <claime>	 !log Rolling back 949583 for T344438
[14:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:40] <stashbot>	 T344438: scap backport fails to build a image for k8s deployment - https://phabricator.wikimedia.org/T344438
[14:45:47] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[14:45:57] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:46:22] <wikibugs>	 10SRE, 10Lingua Libre, 10Traffic: Network issue between LinguaLibre and Wikimedia Commons - https://phabricator.wikimedia.org/T344421 (10Yug) Given the bug is persisting and preventing loggin, we may want to use the sitenotice to gently announce a pause in contributions / log in.
[14:46:27] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:46:47] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:47:23] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2002.codfw.wmnet with OS bookworm
[14:47:35] <wikibugs>	 (03PS1) 10Btullis: Failover hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/949984 (https://phabricator.wikimedia.org/T303168)
[14:47:37] <wikibugs>	 (03PS1) 10Btullis: Fail back hive to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/949985 (https://phabricator.wikimedia.org/T303168)
[14:47:43] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[14:48:01] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:48:19] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:48:49] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Failover hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/949984 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[14:48:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:49:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum3004.esams.wmnet
[14:49:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:49:09] <claime>	 !log Re-deploying mw-on-k8s T344438
[14:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:49:24] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:49:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:49:27] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:49:28] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:49:40] <wikibugs>	 10SRE, 10Lingua Libre, 10Traffic: Network issue between LinguaLibre and Wikimedia Commons - https://phabricator.wikimedia.org/T344421 (10Joe) It would be useful if you could follow https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue to give us a bit more details to go by.
[14:50:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Add doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949987 (https://phabricator.wikimedia.org/T344355)
[14:51:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:51:21] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:51:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3004.esams.wmnet - jmm@cumin2002"
[14:51:34] <wikibugs>	 10SRE, 10Lingua Libre, 10Traffic: Network issue between LinguaLibre and Wikimedia Commons - https://phabricator.wikimedia.org/T344421 (10ayounsi) Hi, see https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue for more troubleshooting commands, but to start with could you provide the output of:  `...
[14:51:44] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Add doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949987 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[14:52:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:52:27] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[14:54:00] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[14:54:01] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[14:54:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3003.esams.wmnet with OS bullseye
[14:54:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3003.esams.wmnet
[14:55:21] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[14:55:22] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[14:56:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[14:56:51] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[14:57:24] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[14:57:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[14:57:28] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[14:57:30] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[14:57:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[14:58:01] <wikibugs>	 10SRE, 10MediaWiki-Core-Revision-backend, 10Performance-Team (Radar): Compress data at external storage - https://phabricator.wikimedia.org/T106386 (10Krinkle)
[14:58:07] <claime>	 urbanecm: ok, fixed and redeployed
[14:58:17] <urbanecm>	 claime: thanks for the fix!
[14:58:17] <claime>	 bare metal and k8s should be sync'd now
[14:58:26] <claime>	 urbanecm: mostly j.oe tbh
[14:58:32] <urbanecm>	 thanks to both :)
[14:59:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3004.esams.wmnet - jmm@cumin2002"
[14:59:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:59:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum3004.esams.wmnet on all recursors
[14:59:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum3004.esams.wmnet on all recursors
[14:59:41] <urbanecm>	 jouncebot: nowandnext
[14:59:41] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 0 minute(s)
[14:59:41] <jouncebot>	 In 1 hour(s) and 0 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1600)
[14:59:48] <urbanecm>	 aanzx: are you still here for your patch?
[14:59:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3004.esams.wmnet - jmm@cumin2002"
[15:00:44] <aanzx>	 urbanecm: yes
[15:00:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3004.esams.wmnet - jmm@cumin2002"
[15:01:04] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Temporarily disable link-recommendation FE on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949988 (https://phabricator.wikimedia.org/T316079)
[15:01:10] <urbanecm>	 okay, let's go ahead.
[15:01:14] <wikibugs>	 (03PS3) 10Urbanecm: suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[15:01:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949987 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[15:01:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[15:01:28] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Temporarily disable link-recommendation FE on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949988 (https://phabricator.wikimedia.org/T316079)
[15:01:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Temporarily disable link-recommendation FE on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949988 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm)
[15:02:01] <wikibugs>	 (03Merged) 10jenkins-bot: suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[15:02:15] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Temporarily disable link-recommendation FE on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949988 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm)
[15:02:47] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949568|suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace (T344314)]], [[gerrit:949988|Growth: Temporarily disable link-recommendation FE on arwiki (T316079)]]
[15:02:52] <stashbot>	 T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314
[15:02:52] <stashbot>	 T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079
[15:03:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3003.wikimedia.org
[15:03:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[15:03:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum3004.esams.wmnet with OS bullseye
[15:04:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:40] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:949568|suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace (T344314)]], [[gerrit:949988|Growth: Temporarily disable link-recommendation FE on arwiki (T316079)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (ac
[15:04:40] <logmsgbot>	 cessible via k8s-experimental XWD option)
[15:04:48] <urbanecm>	 aanzx: please test.
[15:05:01] <aanzx>	 urbanecm:ok
[15:05:41] <wikibugs>	 (03CR) 10Bking: [C: 03+1] hiera: add fake credential for swift user search_update_pipeline [labs/private] - 10https://gerrit.wikimedia.org/r/949944 (https://phabricator.wikimedia.org/T342620) (owner: 10MVernon)
[15:07:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3003.wikimedia.org - jmm@cumin2002"
[15:07:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3003.wikimedia.org - jmm@cumin2002"
[15:07:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:07:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh3003.wikimedia.org on all recursors
[15:07:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3003.wikimedia.org on all recursors
[15:08:19] <wikibugs>	 10SRE-swift-storage, 10observability, 10EngProd-Virtual-Hackathon: Add FileBackend statsd metrics and a dashboard - https://phabricator.wikimedia.org/T217754 (10Krinkle)
[15:08:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3003.wikimedia.org - jmm@cumin2002"
[15:08:31] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:08:57] <urbanecm>	 aanzx: how is it looking please?
[15:09:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3003.wikimedia.org - jmm@cumin2002"
[15:09:32] <aanzx>	 urbanecm: looks good 
[15:09:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh3003.wikimedia.org with OS bullseye
[15:10:01] <urbanecm>	 ack, syncing.
[15:10:03] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync
[15:11:13] <aanzx>	 urbanecm: i don't know why this patch is giving CI error https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/949570 can you take a look
[15:11:52] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fail back hive to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/949985 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[15:11:56] <urbanecm>	 aanzx: i rebased that patch, let's see if it happens again.
[15:13:08] <aanzx>	 Ok
[15:16:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:17:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3004.esams.wmnet with reason: host reimage
[15:17:43] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949568|suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace (T344314)]], [[gerrit:949988|Growth: Temporarily disable link-recommendation FE on arwiki (T316079)]] (duration: 14m 56s)
[15:17:48] <stashbot>	 T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314
[15:17:48] <stashbot>	 T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079
[15:17:51] <urbanecm>	 aanzx: should be live.
[15:18:04] <aanzx>	 urbanecm: thanks 
[15:18:08] <urbanecm>	 np
[15:20:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3004.esams.wmnet with reason: host reimage
[15:21:19] <jnuche>	 urbanecm: are you finished with the backports?
[15:21:33] <urbanecm>	 jnuche: for now, yes :). thanks
[15:21:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:22:21] <jnuche>	 urbanecm: thx, I'm going to update scap
[15:22:56] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.58.0" for 597 hosts
[15:24:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert)
[15:25:01] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) Revalidation by manager waived since it's a ctr to req conversion.  SSH key double-checked on authenticated out of band. Patch i...
[15:25:06] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] admin: New ssh key for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/949839 (https://phabricator.wikimedia.org/T344405) (owner: 10Clément Goubert)
[15:26:01] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.58.0" for 596 hosts
[15:26:11] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] admin: New ssh key for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/949839 (https://phabricator.wikimedia.org/T344405) (owner: 10Clément Goubert)
[15:26:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10mpopov)
[15:26:57] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.58.0" completed for 596 hosts
[15:27:26] <aanzx>	 urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/949570 after rebase it worked , thanks 
[15:27:40] <urbanecm>	 no worries.
[15:28:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert Patch merged, access should be updated after half an hour once puppet has run...
[15:29:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10mpopov) @BTullis: once Omari is added to the analytics private data users group do you prefer he ping you here or on {T328457}...
[15:31:15] <taavi>	 urbanecm: is T344446 related to T344391 or are those separate bugs?
[15:31:16] <stashbot>	 T344446: Notification received from metawiki instead of target site when group modified on metawiki - https://phabricator.wikimedia.org/T344446
[15:31:16] <stashbot>	 T344391: Interwiki user rights changes not being logged correctly - https://phabricator.wikimedia.org/T344391
[15:31:56] <urbanecm>	 taavi: my assumption would be that they're one and the same bug, but let me try how notifications work now.
[15:32:37] <urbanecm>	 taavi: nope, it's a separate bug.
[15:32:59] * urbanecm is noting that on task.
[15:34:25] * urbanecm wishes he has a magic wand to command T342763 completed
[15:34:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3004.esams.wmnet with OS bullseye
[15:34:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3004.esams.wmnet
[15:34:41] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@2d3a0b7] (releasing): (no justification provided)
[15:35:24] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@2d3a0b7] (releasing): (no justification provided) (duration: 00m 43s)
[15:35:35] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:01] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[15:37:38] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[15:38:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert)
[15:41:23] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:42:22] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Growth: Temporarily disable link-recommendation FE on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949585 (https://phabricator.wikimedia.org/T316079)
[15:42:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert) a:03OSefu-WMF @mpopov I'll be handling the access request as well as kerberos principal, but @BTullis will...
[15:46:23] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:47:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[15:49:32] <wikibugs>	 10SRE-swift-storage, 10Data-Persistence, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10Gehel)
[15:50:26] <sukhe>	 !log sukhe@alert1001:~$ sudo systemctl reload icinga.service 
[15:50:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:46] <wikibugs>	 (03PS3) 10BCornwall: Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154)
[15:51:51] <wikibugs>	 (03CR) 10BCornwall: Update dependencies to match Bookworm versions (032 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[15:52:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[15:52:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[15:52:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[15:54:35] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[15:55:18] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10BTullis) Welcome @OSefu-WMF ! I'd be very grateful if you could do me a small favour please.   >>! In T344257#9099732, @Clemen...
[15:55:54] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:57:10] <wikibugs>	 (03PS1) 10Ssingh: site.pp: use correct hostname for doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949993
[15:58:52] <wikibugs>	 (03PS2) 10Ssingh: site.pp: use correct hostname for doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949993
[15:59:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10mpopov) @Clement_Goubert: Thank you! For my own future reference and @OSefu-WMF's clarification – do you mean adding the publi...
[16:00:04] <jouncebot>	 jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:14] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host doh3003.wikimedia.org with OS bullseye
[16:00:14] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host doh3003.wikimedia.org
[16:01:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix doh300[34] entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949994
[16:02:14] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Fix doh300[34] entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949994 (owner: 10Muehlenhoff)
[16:02:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/949993 (owner: 10Ssingh)
[16:02:30] <sukhe>	 moritzm: merge any and I will abandon the other :)
[16:02:37] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2002.codfw.wmnet with OS bookworm
[16:02:39] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Fix doh300[34] entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949994 (owner: 10Muehlenhoff)
[16:02:54] <moritzm>	 sukhe: go ahead with a merge, I just abandoned mine :-)
[16:03:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[16:03:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] site.pp: use correct hostname for doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949993 (owner: 10Ssingh)
[16:09:55] <wikibugs>	 (03PS1) 10Herron: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/950007
[16:11:23] <XioNoX>	 !log merging Puppet change 949934 - Remove all mentions of old-esams, replace with new esams
[16:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove all mentions of old-esams, replace with new esams [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[16:12:01] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:12:11] <icinga-wm>	 RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 461, down: 18, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:13:49] <wikibugs>	 (03PS2) 10Herron: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/950007 (https://phabricator.wikimedia.org/T343987)
[16:14:14] <sukhe>	 !log force agent run on A:lvs and A:esams
[16:14:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:48] <jinxer-wm>	 (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:14:52] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/950007 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[16:16:09] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:17:09] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:17:13] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:17:18] <sukhe>	 ulsfo? 
[16:18:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10colewhite)
[16:18:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10colewhite) Confirmed Omari's ssh key via Slack DM. `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIP3p3IF96m0/MLPgxWxgEbo6QyGZEMc8fj6bn3...
[16:19:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:19:39] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10KFrancis) The NDA has been signed.  Please proceed with next steps.  Thank you!
[16:20:15] <XioNoX>	 sukhe: 198.35.26.7           64605          9          7       0      16        2:56 Establ
[16:20:21] <XioNoX>	 so bgp bounced 3min ago
[16:20:30] <sukhe>	 dns4003
[16:20:33] <sukhe>	 ok
[16:21:19] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[16:21:28] <sukhe>	 looking
[16:21:42] <sukhe>	 Error: 'asw2-esams' is not a valid parent for host 'ganeti3001' (file '/etc/icinga/objects/puppet_hosts.cfg', line 17459)!
[16:21:45] <sukhe>	 Error: 'asw2-esams' is not a valid parent for host 'ganeti3002' (file '/etc/icinga/objects/puppet_hosts.cfg', line 17476)!
[16:21:48] <sukhe>	 Error: 'asw2-esams' is not a valid parent for host 'ganeti3003' (file '/etc/icinga/objects/puppet_hosts.cfg', line 17493)!
[16:22:07] <XioNoX>	 er, ganeti3001/2/3 are not decom yet?
[16:22:40] <sukhe>	 yep
[16:22:41] <XioNoX>	 oh, that's what you were chatting about before with moritzm ?
[16:22:41] <sukhe>	 decommed
[16:22:45] <sukhe>	 no, that was doh
[16:23:07] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:23:21] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:23:29] <sukhe>	 probably a Puppet run race condition then, let's check
[16:24:19] <sukhe>	 https://puppetboard.wikimedia.org/report/alert1001.wikimedia.org/87b4e07ef5d462d28975f160ac7f1fcaeb48c9d5
[16:24:23] <sukhe>	 removals here
[16:24:23] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:13] <XioNoX>	 sukhe: is there a way to expore the puppet ressources to know if there is a "stuck" ganeti one?
[16:26:17] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:53] <sukhe>	 XioNoX: checking
[16:27:00] <sukhe>	 I think the ferm failures above are also somewhat related
[16:27:04] <sukhe>	 as in from this change
[16:27:25] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:27:36] <XioNoX>	 sukhe: I was looking at ferm, it says "Aug 17 16:19:55 ml-serve2005 ferm[1340493]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?"
[16:28:27] <sukhe>	 XioNoX: fixed the mlserve one
[16:28:30] <XioNoX>	 just doing a "ml-serve2005:~$ sudo service ferm start" seems to have solved it
[16:28:33] <XioNoX>	 eh
[16:28:33] <sukhe>	 yeah
[16:28:37] <XioNoX>	 sukhe: what did you do?
[16:28:50] <sukhe>	 ran agent again, we have been seeing some puppet race conditions with ferm in other hosts
[16:28:56] <XioNoX>	 I see
[16:29:03] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:08] <sukhe>	 XioNoX:         host_name                      ganeti3001
[16:29:08] <sukhe>	         hostgroups                     ganeti_esams,asw2-esams
[16:29:14] <sukhe>	 just not seeing where it is coming from though
[16:29:17] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 6.730 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:29:25] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:29:41] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:29:57] <XioNoX>	 sukhe: fixed ferm on kubernetes2007
[16:30:03] <sukhe>	 thanks
[16:30:46] <XioNoX>	 jbond, you're still around?
[16:30:55] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:47] <sukhe>	         address                        10.20.0.31
[16:34:00] <sukhe>	 didn't we purge 10.20.0.0/23 from everywhere?
[16:34:07] <sukhe>	 so this is related
[16:36:05] <XioNoX>	 er, I did a icinga restart instead of reload
[16:36:10] <XioNoX>	 I hope I didn't break it
[16:36:34] <sukhe>	 there might be more alerts 
[16:36:44] <sukhe>	 from external monitoring, but otherwise should be fine
[16:37:14] <wikibugs>	 (03CR) 10Btullis: datahub: Enable OIDC to idp_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[16:37:24] <XioNoX>	 sukhe: I'll remoe them manually from the config and see if puppet tries to add them back
[16:37:35] <XioNoX>	 I did the same yesterday for something else
[16:38:09] <sukhe>	 yeah that's one option though my concern is that if it's there and that might be a symptom of some other place the older config settings is left
[16:38:12] <sukhe>	 but yeah go for it
[16:38:48] <XioNoX>	 sukhe: it's up now
[16:38:51] <XioNoX>	 running puppet
[16:38:55] <sukhe>	 cool :)
[16:39:17] <sukhe>	 do it on alert2001 as well then just to be sure
[16:40:11] <XioNoX>	 waiting for puppet to finish
[16:40:41] <XioNoX>	 sukhe: wow
[16:40:49] <wikibugs>	 (03CR) 10Btullis: datahub: Enable OIDC to idp_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[16:40:50] <XioNoX>	 it added so many checks
[16:40:51] * sukhe holds breath
[16:41:06] <sukhe>	 alert1001?
[16:41:54] <sukhe>	 er, it added back what you changed basically 
[16:42:02] <XioNoX>	 so it removed a bunch of ganeti3001/2/3 checks
[16:42:08] <XioNoX>	 like service checks
[16:42:27] <sukhe>	 yeah that's quite a lot of them
[16:42:44] <sukhe>	 + host_name ganeti3002
[16:42:53] <XioNoX>	 and it re-added them?
[16:42:57] <sukhe>	 yep
[16:43:00] <sukhe>	 same error is back
[16:43:08] <sukhe>	 Error: 'asw2-esams' is not a valid parent for host 'ganeti3003' (file '/etc/icinga/objects/puppet_hosts.cfg', line 17493)!
[16:43:13] <sukhe>	 let's look again
[16:43:36] <XioNoX>	 so we can re-add a dummy asw2-esams
[16:43:42] <XioNoX>	 as workaround
[16:44:16] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[16:44:18] <sukhe>	 by dummy you mean just in Puppet but not in netbox? 
[16:44:25] <XioNoX>	 sukhe: yeah
[16:44:29] <XioNoX>	 cwhite, anyone from o11y can help us out with icinga?
[16:44:46] * cwhite looking
[16:45:04] <XioNoX>	 cwhite: tl;dr; ganeti3001/2/3 are decom, but puppet still adds them to /etc/icinga/objects/puppet_hosts.cfg
[16:45:28] <sukhe>	 and it did remove them previously but not from this file I guess
[16:45:31] <XioNoX>	 which breaks icinga as the switch (parent) as they depend on is gone
[16:46:07] <cwhite>	 ok
[16:46:24] <sukhe>	 cwhite: https://puppetboard.wikimedia.org/report/alert1001.wikimedia.org/87b4e07ef5d462d28975f160ac7f1fcaeb48c9d5
[16:46:27] <sukhe>	 they were removed here
[16:46:36] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:46:51] <XioNoX>	 the parent, yeah, but dunno for the ganeti hosts
[16:47:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk2002.codfw.wmnet
[16:47:48] <sukhe>	     file { '/etc/icinga/objects/puppet_hosts.cfg':
[16:47:49] <sukhe>	       content => generate('/usr/local/bin/naggen2', '--type', 'hosts'),
[16:47:52] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:48:08] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts flink-zk2002.codfw.wmnet
[16:48:21] <sukhe>	 maybe running naggen2 manually? just speculating 
[16:48:35] <cwhite>	 sukhe: the puppet certs for these hosts are still alive
[16:48:44] <sukhe>	 oh
[16:48:48] <icinga-wm>	 PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:48:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2003.codfw.wmnet
[16:48:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[16:49:34] <sukhe>	 https://phabricator.wikimedia.org/T344363#9098423
[16:49:37] <cwhite>	 probably need to cert destroy these hosts.  that should purge their entries from puppet
[16:49:43] <sukhe>	 Host steps raised exception: Cumin execution failed (exit_code=2)
[16:49:48] <icinga-wm>	 RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:50:00] <sukhe>	 seems like the cookbook failed, but not sure why
[16:50:19] <XioNoX>	 ok, now I see it https://puppetboard.wikimedia.org/node/ganeti3001.esams.wmnet
[16:50:37] <XioNoX>	 sukhe: I think the hosts were already unracked before decom
[16:50:46] <XioNoX>	 cwhite: thanks!
[16:51:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2003.codfw.wmnet - bking@cumin1001"
[16:52:12] <sukhe>	 we can do sudo puppet node deactivate ganeti3001.esams.wmnet
[16:52:23] <sukhe>	 cwhite: thanks indeed for the pointer!
[16:52:32] <XioNoX>	 sukhe: about to do `puppet node clean ganeti3001.esams.wmnet`
[16:52:33] <sukhe>	 that should remove them from puppetdb 
[16:52:37] <XioNoX>	 dunno the different with deactivate
[16:52:48] <sukhe>	 yeah clean and then deactivatbe basically
[16:52:53] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2003.codfw.wmnet - bking@cumin1001"
[16:52:53] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:52:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2003.codfw.wmnet on all recursors
[16:52:57] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk2003.codfw.wmnet on all recursors
[16:53:01] <sukhe>	 worth a shot, should I try it?
[16:53:15] <XioNoX>	 I'm on it
[16:53:18] <sukhe>	 ok thanks
[16:53:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2003.codfw.wmnet - bking@cumin1001"
[16:53:28] <sukhe>	 https://wikitech.wikimedia.org/wiki/Puppet#PuppetDB suggests clean and deactivate
[16:53:32] <sukhe>	 in that order apparently
[16:53:52] <XioNoX>	 https://www.irccloud.com/pastebin/uDZu1Exm/
[16:54:00] <sukhe>	 ok
[16:54:02] <sukhe>	 trying a run now
[16:54:07] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2003.codfw.wmnet - bking@cumin1001"
[16:54:07] <sukhe>	 before you do the others
[16:54:13] <sukhe>	 should be a safe operation but yeah good to verify
[16:54:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2003.codfw.wmnet with OS bookworm
[16:54:22] <XioNoX>	 sukhe: too late :)
[16:54:24] <sukhe>	 ha
[16:54:28] <XioNoX>	 I did it on all of them
[16:54:30] <sukhe>	 all good :)
[16:54:33] <XioNoX>	 they're gone host anyway
[16:54:36] <sukhe>	 yeah
[16:54:44] <sukhe>	 XioNoX: let's move to -sre perhaps I guess
[16:54:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:59:05] <wikibugs>	 (03Abandoned) 10Ssingh: 10.in-addr.arpa: remove include for netbox/0.21.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949972 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[16:59:25] <wikibugs>	 (03CR) 10Ssingh: "Do not merge before Monday Aug 21" [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[17:00:06] <jouncebot>	 bd808: gettimeofday() says it's time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1700)
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1700)
[17:01:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:03:11] <bd808>	 I have updates to deploy for shellbox-syntaxhighlight today. Because the shellbox services share a helm chart and some code I will be redeploying all 5 shellbox* services.
[17:03:22] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] shellbox: Bump to 2023-08-15-040901 [deployment-charts] - 10https://gerrit.wikimedia.org/r/949548 (https://phabricator.wikimedia.org/T335460) (owner: 10BryanDavis)
[17:04:12] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: Bump to 2023-08-15-040901 [deployment-charts] - 10https://gerrit.wikimedia.org/r/949548 (https://phabricator.wikimedia.org/T335460) (owner: 10BryanDavis)
[17:04:21] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[17:08:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:08:03] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply
[17:08:58] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[17:09:05] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[17:09:30] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[17:09:35] <icinga-wm>	 RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 461, down: 18, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:09:36] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[17:10:09] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[17:10:16] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:10:45] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:10:52] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[17:11:56] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[17:11:59] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:20] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[17:15:25] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[17:15:32] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[17:16:22] <wikibugs>	 (03PS5) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056)
[17:16:52] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[17:16:59] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[17:16:59] <wikibugs>	 (03PS6) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056)
[17:17:40] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[17:17:47] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:18:35] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:18:41] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[17:19:41] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[17:19:45] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[17:20:25] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[17:21:27] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[17:21:33] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[17:22:16] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[17:22:22] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[17:23:11] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[17:23:17] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:23:55] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:24:02] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[17:24:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:25:00] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[17:32:29] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:15] <wikibugs>	 (03PS1) 10Zabe: admin: New SSH key for zabe [puppet] - 10https://gerrit.wikimedia.org/r/949999
[17:37:54] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum1001.eqiad.wmnet with OS bookworm
[17:39:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum1002.eqiad.wmnet with OS bookworm
[17:39:43] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum2001.codfw.wmnet with OS bookworm
[17:39:51] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum2002.codfw.wmnet with OS bookworm
[17:41:13] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:41:30] <sukhe>	 ^ expected 
[17:41:51] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:41:53] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:42:07] <sukhe>	 ^ durum hosts that brett is reimaging, so all good
[17:42:13] * brett waves
[17:43:09] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-puppet-ca-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:43:21] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:44:01] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:44:09] <icinga-wm>	 PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100%
[17:44:29] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:44:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bookworm
[17:45:10] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum4002.ulsfo.wmnet with OS bookworm
[17:45:23] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum5001.eqsin.wmnet with OS bookworm
[17:45:30] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum5002.eqsin.wmnet with OS bookworm
[17:45:39] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm
[17:45:40] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm
[17:45:59] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:17] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:46:19] <brett>	 This carnage is me mercilessly killing durum hosts
[17:46:31] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:46:39] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job bird in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:47:07] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage
[17:47:53] <brett>	 Sorry for the spam. But think of all the wormies that will be crawling around when this is finished
[17:49:28] <wikibugs>	 (03PS1) 10Ssingh: site: reimage ncredir300[34] to proper role [puppet] - 10https://gerrit.wikimedia.org/r/950000 (https://phabricator.wikimedia.org/T344355)
[17:49:31] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:49:37] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:49:43] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:49:47] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:49:57] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:49:57] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:50:02] <brett>	 ^Ignore!
[17:50:15] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage
[17:51:39] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job bird in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:52:07] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage
[17:55:17] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage
[17:56:16] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] site: reimage ncredir300[34] to proper role [puppet] - 10https://gerrit.wikimedia.org/r/950000 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh)
[17:56:55] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage
[17:57:07] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage
[17:57:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3003.esams.wmnet with OS bullseye
[18:00:05] <jouncebot>	 brennen and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1800).
[18:00:11] <dancy>	 o/
[18:00:36] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950002 (https://phabricator.wikimedia.org/T343724)
[18:00:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950002 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[18:01:29] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950002 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[18:02:06] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage
[18:04:43] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage
[18:07:23] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage
[18:07:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:07:55] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[18:07:57] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage
[18:08:59] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.22  refs T343724
[18:09:02] <stashbot>	 T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724
[18:09:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:10:34] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage
[18:12:32] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage
[18:12:33] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2003.codfw.wmnet with OS bookworm
[18:12:33] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2003.codfw.wmnet
[18:12:36] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[18:13:26] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1001.eqiad.wmnet with OS bookworm
[18:14:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:15:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage
[18:15:48] * Krinkle debugging on mw1439 (jobrunner)
[18:16:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:16:06] <icinga-wm>	 RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[18:17:00] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:17:42] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:17:45] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage
[18:18:32] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:21:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:21:05] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2002.codfw.wmnet with OS bookworm
[18:22:16] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:22:18] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:22:50] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:24:32] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:14] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2001.codfw.wmnet with OS bookworm
[18:26:47] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4002.ulsfo.wmnet with OS bookworm
[18:28:22] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH)
[18:29:59] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage
[18:30:01] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage
[18:33:16] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage
[18:33:32] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:33:37] <sukhe>	 eh?
[18:33:42] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:35:26] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage
[18:35:38] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1002.eqiad.wmnet with OS bookworm
[18:37:33] <wikibugs>	 (03PS1) 10Ssingh: hiera: update acme-chief authorized_hosts for ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950005
[18:39:32] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:40:52] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] hiera: update acme-chief authorized_hosts for ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950005 (owner: 10Ssingh)
[18:41:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: update acme-chief authorized_hosts for ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950005 (owner: 10Ssingh)
[18:41:26] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:41:42] <sukhe>	 !log force agent run on A:acmechief for CR 950005
[18:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:43:35] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[18:43:42] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:47:24] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:47:33] <sukhe>	 ^ resolving 
[18:47:46] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: set legacy ssl port & optional encryption to false [puppet] - 10https://gerrit.wikimedia.org/r/949587 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans)
[18:48:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:49:18] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:49:20] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS bookworm
[18:51:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:51:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3004.esams.wmnet with OS bullseye
[18:53:06] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:53:10] <urandom>	 !log Rolling Cassandra restart codfw/b — T339298
[18:53:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:13] <stashbot>	 T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298
[18:54:00] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3003.esams.wmnet with OS bullseye
[18:54:30] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm
[18:55:58] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:56:06] <wikibugs>	 (03PS1) 10Ssingh: conf-tool/esams: add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950026 (https://phabricator.wikimedia.org/T344355)
[18:56:16] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.82:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[18:57:10] <urandom>	 that's me ^^^
[18:57:24] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:58:06] <urandom>	 I was so sure that would come under the threshold of an alert 🤨
[18:58:50] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.83:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[18:58:50] <wikibugs>	 (03PS1) 10Gehel: query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361)
[18:59:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[19:00:04] <urandom>	 oh! 🤦‍♂️
[19:00:42] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a SSL 10.192.16.82:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans SSL port has moved https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:00:42] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b SSL 10.192.16.83:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans SSL port has moved https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:00:42] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c SSL 10.192.16.84:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans SSL port has moved https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:01:17] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum4001.ulsfo.wmnet with OS bookworm
[19:01:39] <wikibugs>	 (03PS2) 10Gehel: query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361)
[19:02:30] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bookworm
[19:02:45] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[19:03:57] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[19:07:34] <wikibugs>	 (03PS1) 10Eevans: restbase: Use port 7000 for ssl monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/950028 (https://phabricator.wikimedia.org/T339298)
[19:07:54] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage
[19:09:29] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: Use port 7000 for ssl monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/950028 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans)
[19:11:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage
[19:13:29] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:19:09] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:20:02] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5001.eqsin.wmnet with OS bookworm
[19:21:43] <wikibugs>	 (03PS1) 10Gehel: Revert "query_service: fix glob expansion in blazegraph systemd unit" [puppet] - 10https://gerrit.wikimedia.org/r/950008
[19:22:09] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5002.eqsin.wmnet with OS bookworm
[19:22:25] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[19:22:41] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] Revert "query_service: fix glob expansion in blazegraph systemd unit" [puppet] - 10https://gerrit.wikimedia.org/r/950008 (owner: 10Gehel)
[19:22:48] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Revert "query_service: fix glob expansion in blazegraph systemd unit" [puppet] - 10https://gerrit.wikimedia.org/r/950008 (owner: 10Gehel)
[19:23:18] <wikibugs>	 (03PS1) 10Gehel: Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/950009
[19:23:39] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/950009 (owner: 10Gehel)
[19:24:01] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/950009 (owner: 10Gehel)
[19:24:27] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:24:48] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[19:25:36] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[19:35:23] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3004.esams.wmnet with OS bullseye
[19:36:47] <wikibugs>	 (03CR) 10Thcipriani: "The current default for Gerrit is 64. So 8 is probably OK 😊" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto)
[19:40:10] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.234:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:51:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:42] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:54:48] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:56:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:05] <jouncebot>	 brennen and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T2000).
[20:00:06] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:03:55] <urandom>	 !log Rolling Cassandra restart codfw/c (RESTBase cluster) — T339298
[20:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:09] <stashbot>	 T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298
[20:06:00] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[20:08:02] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:32] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:11:12] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:12:16] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:12:22] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS bookworm
[20:12:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[20:15:04] <jinxer-wm>	 (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:15:55] <wikibugs>	 (03PS1) 10Thcipriani: Add newline to README for backport training [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950036
[20:19:45] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverses for Lumen transport esams eqiad - cmooney@cumin1001"
[20:20:00] <urandom>	 !log Rolling Cassandra restart codfw/d (RESTBase cluster) — T339298
[20:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:05] <stashbot>	 T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298
[20:20:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950036 (owner: 10Thcipriani)
[20:20:57] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverses for Lumen transport esams eqiad - cmooney@cumin1001"
[20:20:57] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:21:02] <wikibugs>	 (03Merged) 10jenkins-bot: Add newline to README for backport training [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950036 (owner: 10Thcipriani)
[20:21:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:21:16] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:950036|Add newline to README for backport training]]
[20:22:55] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani: Backport for [[gerrit:950036|Add newline to README for backport training]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:25:26] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:28:39] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani: Continuing with sync
[20:30:53] <urandom>	 !log Rolling Cassandra restart eqiad/a (RESTBase cluster) — T339298
[20:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:05] <stashbot>	 T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298
[20:34:46] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:950036|Add newline to README for backport training]] (duration: 13m 29s)
[20:35:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10ATsay-WMF)
[20:35:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:36:14] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:38:20] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Add newline to README for backport training" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950011
[20:38:33] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] conf-tool/esams: add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950026 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh)
[20:38:52] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:39:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950011 (owner: 10Thcipriani)
[20:39:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add newline to README for backport training" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950011 (owner: 10Thcipriani)
[20:40:05] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:950011|Revert "Add newline to README for backport training"]]
[20:40:13] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1010.eqiad.wmnet with OS bullseye
[20:40:22] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1011.eqiad.wmnet with OS bullseye
[20:40:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:41:32] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani: Backport for [[gerrit:950011|Revert "Add newline to README for backport training"]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:41:38] <sukhe>	 !log restart pybal on lvs3010
[20:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:02] <thcipriani>	 sukhe: are we about to trigger that thing where scap depools everything if pybal is down?
[20:43:09] <sukhe>	 thcipriani: nope :)
[20:43:13] <sukhe>	 that has been resolved so should be fine
[20:43:18] <thcipriani>	 \o/
[20:43:33] <thcipriani>	 cool, alright, I'll continue my sync then if all's fine
[20:43:41] <sukhe>	 thcipriani: please do
[20:43:52] <thcipriani>	 <2
[20:43:54] <thcipriani>	 er
[20:43:56] <thcipriani>	 <3
[20:44:34] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani: Continuing with sync
[20:44:38] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1010.eqiad.wmnet with OS bullseye
[20:46:08] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[20:46:19] <sukhe>	 ^ should resolve
[20:46:24] <sukhe>	 soonish, in progress
[20:46:53] <sukhe>	 !log restart pybal on lvs3008
[20:46:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:59] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[20:49:16] <urandom>	 !log Rolling Cassandra restart eqiad/b (RESTBase cluster) — T339298
[20:49:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:19] <stashbot>	 T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298
[20:50:48] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:950011|Revert "Add newline to README for backport training"]] (duration: 10m 43s)
[20:51:24] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[20:51:50] <wikibugs>	 (03PS7) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219)
[20:53:28] <wikibugs>	 (03CR) 10Ssingh: "New change is the removal of the comment before ncredir, as we reimaged ncredir300[34], so will be pooling that as well. Previously we dec" [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[20:54:03] <wikibugs>	 (03CR) 10Ssingh: "Do not merge before Monday." [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[20:55:26] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1011.eqiad.wmnet with reason: host reimage
[20:56:06] <urandom>	 !log Rolling Cassandra restart eqiad/d (RESTBase cluster) — T339298
[20:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:13] <stashbot>	 T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298
[20:56:50] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[20:58:00] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1011.eqiad.wmnet with reason: host reimage
[21:01:26] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[21:03:35] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:03:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:09:34] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:10:45] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:11:00] <wikibugs>	 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) replaced the system board and the controller. System still did not post. pulled out everything except 1 ram, 1 cpu, a psu. Booted and started adding back components. Found...
[21:11:24] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@1d60a29]: make wikibase ttl imports to hdfs world readable
[21:11:36] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@1d60a29]: make wikibase ttl imports to hdfs world readable (duration: 00m 11s)
[21:16:21] <wikibugs>	 (03PS1) 10Cathal Mooney: Correct IP for Arelion BGP peering esams. [homer/public] - 10https://gerrit.wikimedia.org/r/950042
[21:17:34] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Correct IP for Arelion BGP peering esams. [homer/public] - 10https://gerrit.wikimedia.org/r/950042 (owner: 10Cathal Mooney)
[21:18:05] <wikibugs>	 (03Merged) 10jenkins-bot: Correct IP for Arelion BGP peering esams. [homer/public] - 10https://gerrit.wikimedia.org/r/950042 (owner: 10Cathal Mooney)
[21:21:54] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:21:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:22:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:24:28] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:25:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:25:14] <wikibugs>	 (03CR) 10Jon Harald Søby: "Surely the Page and Index namespaces should be set in the ProofreadPage extension's  ProofreadPage.namespaces.php instead of WMF config, n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[21:27:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:29:08] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:29:15] <wikibugs>	 (03CR) 10Jon Harald Søby: Some initial configurations for suwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[21:34:01] <wikibugs>	 (03CR) 10Jon Harald Søby: Some initial configurations for suwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[21:34:04] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 5/5 UP : 4 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:34:36] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1011.eqiad.wmnet with OS bullseye
[21:34:43] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:40:38] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
[21:40:42] <stashbot>	 T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124
[21:40:54] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 16s)
[21:41:20] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:46:32] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[21:46:39] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[21:51:50] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[21:54:11] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:57:14] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[21:59:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:01:54] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:03:22] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:04:47] <wikibugs>	 (03PS1) 10Cathal Mooney: Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045
[22:05:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 (owner: 10Cathal Mooney)
[22:08:32] <wikibugs>	 (03PS2) 10Cathal Mooney: Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045
[22:09:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 (owner: 10Cathal Mooney)
[22:16:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:47] <wikibugs>	 (03PS1) 10DDesouza: Undeploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950046 (https://phabricator.wikimedia.org/T336092)
[22:20:22] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:08] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[22:29:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:30:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10odimitrijevic) Approving group membership
[22:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[22:33:22] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[23:03:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:22] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:27:37] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)