[00:34:52] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[00:38:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/964634
[00:38:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/964634 (owner: 10TrainBranchBot)
[00:46:12] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:12] <wikibugs>	 (03PS5) 10Cathal Mooney: Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583)
[00:53:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/964634 (owner: 10TrainBranchBot)
[00:54:58] <wikibugs>	 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10matmarex) For what it's worth, there's plenty of error logging indicating that Math has trouble contacting RESTBase: https://l...
[02:01:02] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1104
[02:02:23] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1104
[02:03:24] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1104.mgmt.eqiad.wmnet with reboot policy FORCED
[02:15:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T343198)', diff saved to https://phabricator.wikimedia.org/P52892 and previous config saved to /var/cache/conftool/dbconfig/20231011-021513-arnaudb.json
[02:15:18] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[02:18:30] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:18:51] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1104.mgmt.eqiad.wmnet with reboot policy FORCED
[02:27:46] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[02:27:54] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[02:30:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P52893 and previous config saved to /var/cache/conftool/dbconfig/20231011-023019-arnaudb.json
[02:38:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:44:32] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:45:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P52894 and previous config saved to /var/cache/conftool/dbconfig/20231011-024526-arnaudb.json
[02:49:26] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[02:49:34] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[03:00:33] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T343198)', diff saved to https://phabricator.wikimedia.org/P52895 and previous config saved to /var/cache/conftool/dbconfig/20231011-030032-arnaudb.json
[03:00:35] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[03:00:42] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[03:00:48] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[03:00:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52896 and previous config saved to /var/cache/conftool/dbconfig/20231011-030054-arnaudb.json
[03:03:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:33:32] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:36:26] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:36:54] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:36:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:00:41] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2023-10-11-045323-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/964846 (https://phabricator.wikimedia.org/T341478)
[05:07:51] * kart_ deploying cxserver..
[05:08:10] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-10-11-045323-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/964846 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry)
[05:09:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-10-11-045323-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/964846 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry)
[05:10:41] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:11:03] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:18:55] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:19:29] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:21:04] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:21:33] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:23:54] <kart_>	 !log Updated cxserver to 2023-10-11-045323-production (T341478, T344982, T338432, T347939)
[05:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:03] <stashbot>	 T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939
[05:24:03] <stashbot>	 T344982: Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982
[05:24:03] <stashbot>	 T341478: Port the markup transfer feature of cxserver to MinT - https://phabricator.wikimedia.org/T341478
[05:24:04] <stashbot>	 T338432: Prepare the cxserver for usage without RESTbase - https://phabricator.wikimedia.org/T338432
[05:25:13] <kart_>	 Looks like cxserver is down. Checking.
[05:39:42] <wikibugs>	 (03PS1) 10KartikMistry: cxserver: Fix restbase path [deployment-charts] - 10https://gerrit.wikimedia.org/r/965019
[05:40:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:40:44] <wikibugs>	 (03PS1) 10KartikMistry: Revert "Update cxserver to 2023-10-11-045323-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964603
[05:41:05] <wikibugs>	 (03Abandoned) 10KartikMistry: cxserver: Fix restbase path [deployment-charts] - 10https://gerrit.wikimedia.org/r/965019 (owner: 10KartikMistry)
[05:42:41] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2023-10-11-045323-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964603 (owner: 10KartikMistry)
[05:43:24] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update cxserver to 2023-10-11-045323-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964603 (owner: 10KartikMistry)
[05:44:16] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:44:32] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:45:04] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:45:38] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:45:55] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:46:17] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:50:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:51:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:55:52] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:58:37] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-10-11-045323-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T341478)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T0600)
[06:11:23] <wikibugs>	 (03PS1) 10Slyngshede: P:monitoring remove remnants of dpkg monitoring [puppet] - 10https://gerrit.wikimedia.org/r/965024 (https://phabricator.wikimedia.org/T332764)
[06:37:01] <wikibugs>	 (03CR) 10Elukey: team-ml: add alert for Kafka consumer lag for ores extension (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[06:38:14] <wikibugs>	 (03CR) 10Elukey: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[06:38:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans)
[06:43:10] <wikibugs>	 (03CR) 10Elukey: "Forgot also one thing - we can add test fixtures, see in other directories how it is done (basically you add a _test.yaml file etc..)." [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[06:43:16] <wikibugs>	 (03CR) 10Elukey: "Forgot also one thing - we can add test fixtures, see in other directories how it is done (basically you add a _test.yaml file etc..)." [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[06:45:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: test kserve batcher for revertrisk-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/964915 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou)
[06:51:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:53:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:55:54] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:56:00] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:59:56] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T0700). Please do the needful.
[07:00:05] <jouncebot>	 sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:34] <sergi0>	 hi
[07:01:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:03:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:03:33] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:06:52] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:11:17] <hashar>	 jouncebot: now
[07:11:17] <jouncebot>	 For the next 0 hour(s) and 48 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T0700)
[07:11:39] <hashar>	 sergi0: good morning, I guess I will do the deployments :]
[07:11:59] <sergi0>	 hashar: I was about to start myself, as you wish :)
[07:13:05] <sergi0>	 *good morning :)
[07:13:24] <hashar>	 oh if you know how to deploy please go ahead!
[07:13:36] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 (owner: 10EoghanGaffney)
[07:13:38] <wikibugs>	 (03CR) 10Elukey: "Left a comment about build vs runtime OS. Another qs - have you tried to run docker-pkg locally to build the new image? To verify errors e" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková)
[07:13:44] <sergi0>	 sure, starting
[07:14:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964929 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno)
[07:14:03] <hashar>	 I have added a patch to this window to unblock the mediawiki train  which I will run in ~ 45 minutes ( https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/964600/ )
[07:14:11] <hashar>	 and will deploy it once you are done ;)
[07:14:38] <sergi0>	 alright
[07:14:41] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink frontend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964929 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno)
[07:15:43] <logmsgbot>	 !log sgimeno@deploy2002 Started scap: Backport for [[gerrit:964929|GrowthExperiments: enable AddLink frontend 14th round of wikis (T308139)]]
[07:15:48] <stashbot>	 T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139
[07:17:10] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:964929|GrowthExperiments: enable AddLink frontend 14th round of wikis (T308139)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:19:00] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[07:24:48] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap: Backport for [[gerrit:964929|GrowthExperiments: enable AddLink frontend 14th round of wikis (T308139)]] (duration: 09m 05s)
[07:24:52] <stashbot>	 T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139
[07:25:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno)
[07:25:42] <wikibugs>	 (03PS2) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend 15th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141)
[07:26:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno)
[07:26:54] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend 15th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno)
[07:27:16] <logmsgbot>	 !log sgimeno@deploy2002 Started scap: Backport for [[gerrit:964949|GrowthExperiments: enable AddLink backend 15th round of wikis (T308141)]]
[07:27:20] <stashbot>	 T308141: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141
[07:27:54] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: add listener for mw-api in the rec-api-ng's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) (owner: 10Elukey)
[07:28:35] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:964949|GrowthExperiments: enable AddLink backend 15th round of wikis (T308141)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:29:19] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[07:29:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add listener for mw-api in the rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) (owner: 10Elukey)
[07:32:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[07:35:01] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap: Backport for [[gerrit:964949|GrowthExperiments: enable AddLink backend 15th round of wikis (T308141)]] (duration: 07m 45s)
[07:35:07] <stashbot>	 T308141: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141
[07:35:14] <wikibugs>	 (03CR) 10Hashar: "I have commented on the task ( T340788#8991308 ) that the httpb tests should probably exercise the whole stack (ATS/Varnish caches > Envoy" [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) (owner: 10EoghanGaffney)
[07:35:33] <sergi0>	 hashar: finished my patches
[07:35:37] <hashar>	 great :)
[07:35:45] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Move @font-size-base into mediawiki.skin.variables.less [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964601 (https://phabricator.wikimedia.org/T348572) (owner: 10Jdlrobson)
[07:35:46] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[07:35:53] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Fixes Echo skin style for user message bar [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964600 (https://phabricator.wikimedia.org/T348530) (owner: 10Jdlrobson)
[07:36:02] <hashar>	 I am doing the backports for Vector
[07:36:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10MoritzMuehlenhoff)
[07:37:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Add aqu to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/965049 (https://phabricator.wikimedia.org/T347296)
[07:37:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add aqu to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/965049 (https://phabricator.wikimedia.org/T347296) (owner: 10Muehlenhoff)
[07:38:36] <wikibugs>	 (03PS2) 10Muehlenhoff: Add aqu to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/965049 (https://phabricator.wikimedia.org/T347296)
[07:39:20] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cirrus-streaming-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/964928 (owner: 10Ebernhardson)
[07:39:39] <wikibugs>	 (03CR) 10DCausse: admin: Add cirrus-streaming-updater namespace to flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/964567 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[07:44:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add aqu to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/965049 (https://phabricator.wikimedia.org/T347296) (owner: 10Muehlenhoff)
[07:45:19] <wikibugs>	 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 5 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Clement_Goubert)
[07:50:36] <wikibugs>	 (03Merged) 10jenkins-bot: Move @font-size-base into mediawiki.skin.variables.less [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964601 (https://phabricator.wikimedia.org/T348572) (owner: 10Jdlrobson)
[07:50:44] <wikibugs>	 (03Merged) 10jenkins-bot: Fixes Echo skin style for user message bar [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964600 (https://phabricator.wikimedia.org/T348530) (owner: 10Jdlrobson)
[07:54:53] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 3 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Antoine_Quhen I've enabled your access on th...
[07:57:28] <wikibugs>	 (03CR) 10JMeybohm: "Sorry for volunteering you Hugh - I might be missing something here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry)
[08:00:02] <logmsgbot>	 !log hashar@deploy2002 Synchronized php-1.41.0-wmf.30/skins/Vector: Backports for Vector styling issues T348572 T348530 (duration: 06m 16s)
[08:00:05] <jouncebot>	 hashar and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T0800).
[08:00:17] <stashbot>	 T348572: Wrong font size in OOUI dropdowns in Vector - https://phabricator.wikimedia.org/T348572
[08:00:18] <stashbot>	 T348530: Less_Exception_Compiler: variable @min-width-desktop-wide is undefined in file /srv/mediawiki/php-1.41.0-wmf.30/skins/Vector/skinStyles/ext.echo.styles.alert.less in ext.echo.styles.alert.less on line 22, column 2320 - https://phabricator.wikimedia.org/T348530
[08:08:40] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[08:08:44] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[08:15:50] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[08:15:54] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[08:16:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff)
[08:30:38] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965053 (https://phabricator.wikimedia.org/T347081)
[08:30:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965053 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot)
[08:31:23] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965053 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot)
[08:33:15] <hashar>	 andre and I are running the MediaWiki  train
[08:36:09] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Add namespace for wikifunctions mediawiki deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544)
[08:36:11] <wikibugs>	 (03PS1) 10JMeybohm: Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544)
[08:38:23] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.30  refs T347081
[08:38:27] <stashbot>	 T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081
[08:39:15] <wikibugs>	 (03PS1) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544)
[08:40:04] <wikibugs>	 (03PS1) 10AikoChou: ml-services: upgrade kserve to 0.11.1 for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/965057 (https://phabricator.wikimedia.org/T347550)
[08:42:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "One additional verification needed then lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh)
[08:44:24] <logmsgbot>	 !log hashar@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.30  refs T347081 (duration: 06m 00s)
[08:44:27] <wikibugs>	 (03CR) 10Muehlenhoff: idp: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff)
[08:44:27] <stashbot>	 T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081
[08:44:35] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff)
[08:45:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: upgrade kserve to 0.11.1 for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/965057 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[08:45:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:monitoring remove remnants of dpkg monitoring [puppet] - 10https://gerrit.wikimedia.org/r/965024 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede)
[08:47:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff)
[08:47:44] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) p:05Triage→03High
[08:49:07] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: upgrade kserve to 0.11.1 for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/965057 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[08:49:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] idp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff)
[08:50:11] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: upgrade kserve to 0.11.1 for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/965057 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[08:53:58] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[08:59:53] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) (owner: 10Cathal Mooney)
[09:02:16] <wikibugs>	 (03CR) 10Ayounsi: "Note that there is a merge conflict with I2764b25d3fc32d9b2ee2ecc5e6115f5a08427fcb I can't rebase this one on top of it." [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[09:02:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[09:08:02] <wikibugs>	 (03CR) 10Clément Goubert: ml-services: add listener for mw-api in the rec-api-ng's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) (owner: 10Elukey)
[09:09:00] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM - I don't know about the inside changes of the image, but the iamge exists in the registry and I trust the fact that it has been test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964848 (https://phabricator.wikimedia.org/T343511) (owner: 10Elukey)
[09:09:34] <wikibugs>	 (03CR) 10Vgutierrez: wikifunctions: Add routing to separate mw-on-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert)
[09:10:56] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond)
[09:12:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond)
[09:12:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] late_command: update puppet installation logic [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond)
[09:15:02] <wikibugs>	 (03PS1) 10Joal: Bump mediawiki_history_snapshot to 2023-09 [puppet] - 10https://gerrit.wikimedia.org/r/965059
[09:15:35] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[09:19:04] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.netbox
[09:19:09] <wikibugs>	 (03PS2) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544)
[09:19:24] <wikibugs>	 (03CR) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert)
[09:23:03] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add VIPs for mw-wikifunction - jayme@cumin1001"
[09:23:52] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add VIPs for mw-wikifunction - jayme@cumin1001"
[09:23:52] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:25:15] <wikibugs>	 (03CR) 10JMeybohm: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert)
[09:27:03] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) This happens again repeatedly with not so bi...
[09:29:01] <wikibugs>	 (03PS1) 10Jbond: late_command: add backwards compatible fallback: [puppet] - 10https://gerrit.wikimedia.org/r/965061
[09:29:51] <wikibugs>	 (03PS1) 10JMeybohm: Add mw-wikifunctions records [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544)
[09:29:55] <wikibugs>	 (03PS2) 10Jbond: late_command: add backwards compatible fallback: [puppet] - 10https://gerrit.wikimedia.org/r/965061
[09:31:15] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS bullseye
[09:32:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/965061 (owner: 10Jbond)
[09:32:50] <wikibugs>	 (03CR) 10Vgutierrez: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert)
[09:33:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965061 (owner: 10Jbond)
[09:33:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] late_command: add backwards compatible fallback: [puppet] - 10https://gerrit.wikimedia.org/r/965061 (owner: 10Jbond)
[09:34:34] <wikibugs>	 10SRE-swift-storage, 10Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10MatthewVernon) I've gone looking, and the problem is that only one swift cluster has this object: ` root@ms-fe1009:/etc/swift# s...
[09:34:47] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[09:34:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10MoritzMuehlenhoff) >>! In T342537#9240999, @Papaul wrote: > looking at the gerrit history about the late command i see also that there where some changes m...
[09:37:34] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:43:38] <wikibugs>	 (03PS3) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544)
[09:44:17] <wikibugs>	 (03CR) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert)
[09:47:00] <wikibugs>	 (03PS1) 10Jbond: bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063
[09:47:54] <wikibugs>	 (03PS2) 10Jbond: bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063
[09:49:11] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[09:50:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 (owner: 10Jbond)
[09:52:28] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[09:52:34] <moritzm>	 !log rebuilding RAID after disk replacement T348429
[09:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:37] <stashbot>	 T348429: Broken disk on ganeti1022 - https://phabricator.wikimedia.org/T348429
[09:53:38] <wikibugs>	 (03PS3) 10Jbond: bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063
[09:54:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 (owner: 10Jbond)
[09:54:13] <wikibugs>	 (03PS1) 10JMeybohm: Add mw-wikifunctions discovery records [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544)
[09:54:30] <wikibugs>	 (03PS1) 10JMeybohm: service::catalog: Add mw-wikifunctions - 1 [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544)
[09:54:37] <wikibugs>	 (03CR) 10Volans: "unrelated comments, but make sense to add them here I think" [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[09:55:35] <wikibugs>	 (03PS4) 10Jbond: bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063
[09:57:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43984/console" [puppet] - 10https://gerrit.wikimedia.org/r/965063 (owner: 10Jbond)
[09:57:44] <wikibugs>	 (03CR) 10JMeybohm: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert)
[09:58:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 (owner: 10Jbond)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1000)
[10:02:28] <wikibugs>	 (03PS2) 10JMeybohm: Add mw-wikifunctions records [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544)
[10:02:30] <wikibugs>	 (03PS2) 10JMeybohm: Add mw-wikifunctions discovery records [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544)
[10:03:26] <wikibugs>	 (03CR) 10JMeybohm: Add mw-wikifunctions records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[10:07:14] <wikibugs>	 (03PS4) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544)
[10:07:53] <wikibugs>	 (03CR) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert)
[10:08:14] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[10:09:02] <wikibugs>	 (03CR) 10Volans: Add mw-wikifunctions records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[10:10:28] <wikibugs>	 (03PS12) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910)
[10:10:30] <wikibugs>	 (03PS42) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910)
[10:11:56] <wikibugs>	 (03CR) 10Clément Goubert: admin_ng: Add namespace for wikifunctions mediawiki deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[10:13:47] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[10:14:30] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092
[10:14:53] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[10:15:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52897 and previous config saved to /var/cache/conftool/dbconfig/20231011-101545-arnaudb.json
[10:15:50] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[10:16:49] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43985/console" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[10:17:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[10:19:10] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[10:22:21] <wikibugs>	 (03PS1) 10Slyngshede: P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093
[10:22:35] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) I've spent some time checking, and...
[10:22:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede)
[10:24:07] <wikibugs>	 (03PS2) 10Slyngshede: P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093
[10:24:59] <wikibugs>	 (03CR) 10Hashar: "For Zuul in production:" [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar)
[10:25:34] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43986/console" [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede)
[10:27:03] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43987/console" [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede)
[10:28:35] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:monitoring remove remnants of dpkg monitoring [puppet] - 10https://gerrit.wikimedia.org/r/965024 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede)
[10:30:32] <wikibugs>	 (03CR) 10Muehlenhoff: P:idm Provide callback to test system. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede)
[10:30:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P52898 and previous config saved to /var/cache/conftool/dbconfig/20231011-103052-arnaudb.json
[10:30:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix mariadb restart behaviour on testreduce [puppet] - 10https://gerrit.wikimedia.org/r/965095
[10:31:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix mariadb restart behaviour on testreduce [puppet] - 10https://gerrit.wikimedia.org/r/965095
[10:31:40] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy keytabs for apt1002/apt2002 [labs/private] - 10https://gerrit.wikimedia.org/r/964900 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[10:33:13] <wikibugs>	 (03PS3) 10Slyngshede: P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093
[10:34:17] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10fnegri) > new error popped up after rebooting > T348550  This seems to have resolved on its own? `/usr/local...
[10:34:50] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43988/console" [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede)
[10:36:14] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] P:idm Provide callback to test system. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede)
[10:36:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix mariadb restart behaviour on testreduce [puppet] - 10https://gerrit.wikimedia.org/r/965095 (owner: 10Muehlenhoff)
[10:38:19] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) (owner: 10Cathal Mooney)
[10:38:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 04-1] "+1 on Janis's comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry)
[10:38:53] <wikibugs>	 (03Merged) 10jenkins-bot: Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) (owner: 10Cathal Mooney)
[10:40:55] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng: Add namespace for wikifunctions mediawiki deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544)
[10:40:57] <wikibugs>	 (03PS2) 10JMeybohm: Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544)
[10:40:59] <wikibugs>	 (03PS32) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[10:41:01] <wikibugs>	 (03PS13) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910)
[10:41:03] <wikibugs>	 (03PS43) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910)
[10:41:29] <wikibugs>	 (03CR) 10JMeybohm: admin_ng: Add namespace for wikifunctions mediawiki deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[10:41:32] <wikibugs>	 (03PS1) 10Jbond: late_command: need to update the target apt config [puppet] - 10https://gerrit.wikimedia.org/r/965096
[10:43:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede)
[10:43:36] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede)
[10:44:11] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Now I get `00411: FAILED: stashfailed: An un...
[10:44:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] late_command: need to update the target apt config [puppet] - 10https://gerrit.wikimedia.org/r/965096 (owner: 10Jbond)
[10:45:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P52899 and previous config saved to /var/cache/conftool/dbconfig/20231011-104558-arnaudb.json
[10:47:34] <wikibugs>	 (03PS3) 10Stevemunene: druid: Bring druid1011.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962249 (https://phabricator.wikimedia.org/T336042)
[10:47:54] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) [I'm afraid my previous comment sti...
[10:48:04] <wikibugs>	 (03CR) 10Daniel Kinzler: Update cxserver to 2023-10-11-045323-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry)
[10:48:52] <wikibugs>	 (03PS1) 10FNegri: wmcs::cloudlb: add cloud_production profile [puppet] - 10https://gerrit.wikimedia.org/r/965098
[10:49:16] <wikibugs>	 (03PS1) 10Samtar: InitialiseSettings-labs: Enable Edit Recovery on all beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965099
[10:50:14] <TheresNoTime>	 jouncebot: nowandnext
[10:50:14] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1000)
[10:50:14] <jouncebot>	 In 2 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1300)
[10:50:31] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965099 (owner: 10Samtar)
[10:51:14] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable Edit Recovery on all beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965099 (owner: 10Samtar)
[10:52:31] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:52:41] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:53:48] <wikibugs>	 (03PS33) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[10:53:50] <wikibugs>	 (03PS14) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910)
[10:53:52] <wikibugs>	 (03PS44) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910)
[10:54:33] <wikibugs>	 (03PS1) 10Kosta Harlan: labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018)
[10:55:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Assign apt_repo role to apt1002 [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613)
[10:57:17] <wikibugs>	 (03PS4) 10Cathal Mooney: YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230)
[10:58:02] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[10:58:37] <wikibugs>	 (03Merged) 10jenkins-bot: YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[10:59:12] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8 DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43990/console" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[10:59:45] <wikibugs>	 (03CR) 10Dreamy Jazz: [C: 03+1] labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) (owner: 10Kosta Harlan)
[11:01:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52900 and previous config saved to /var/cache/conftool/dbconfig/20231011-110105-arnaudb.json
[11:01:07] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance
[11:01:16] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[11:01:21] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance
[11:01:27] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52901 and previous config saved to /var/cache/conftool/dbconfig/20231011-110127-arnaudb.json
[11:03:33] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:08:04] <wikibugs>	 (03CR) 10Btullis: Support configuring the spark3 defaults with the default shuffler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:10:47] <wikibugs>	 (03PS1) 10Hashar: zuul: move Gerrit key from merger to server [puppet] - 10https://gerrit.wikimedia.org/r/965103
[11:12:46] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:12:54] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] druid: Bring druid1011.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962249 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[11:12:56] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:13:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] zuul: move Gerrit key from merger to server [puppet] - 10https://gerrit.wikimedia.org/r/965103 (owner: 10Hashar)
[11:14:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney) 05Open→03Resolved Changes pushed to production, closing task.
[11:18:27] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[11:18:30] <wikibugs>	 (03CR) 10Jbond: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[11:20:25] <wikibugs>	 (03PS34) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[11:20:27] <wikibugs>	 (03PS15) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910)
[11:20:29] <wikibugs>	 (03PS45) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910)
[11:21:34] <wikibugs>	 (03PS4) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312)
[11:22:38] <wikibugs>	 (03PS1) 10Jbond: late_command: set certificate_revocation = leaf in puppet [puppet] - 10https://gerrit.wikimedia.org/r/965104 (https://phabricator.wikimedia.org/T340543)
[11:23:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:23:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney)
[11:24:03] <wikibugs>	 (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:26:39] <wikibugs>	 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Volans) 05Resolved→03Open FYI The service IPs in Netbox are still allocated to the service and probably needs cleanup: https://netbox.wikimedia.org/ipam...
[11:27:09] <wikibugs>	 (03PS1) 10Slyngshede: SUL account linking, display success message. [software/bitu] - 10https://gerrit.wikimedia.org/r/965105
[11:27:56] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on druid1011.eqiad.wmnet with reason: Downtime as we setup the host to join the druid and zookeper cluster
[11:28:11] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on druid1011.eqiad.wmnet with reason: Downtime as we setup the host to join the druid and zookeper cluster
[11:29:34] <wikibugs>	 (03PS1) 10Hashar: zuul: get ssh key from Puppet collected resource [puppet] - 10https://gerrit.wikimedia.org/r/965106
[11:31:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:31:33] <wikibugs>	 (03PS1) 10Slyngshede: P:idm use service_fqdn to link correctly. [puppet] - 10https://gerrit.wikimedia.org/r/965107
[11:32:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] zuul: get ssh key from Puppet collected resource [puppet] - 10https://gerrit.wikimedia.org/r/965106 (owner: 10Hashar)
[11:36:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:39:02] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1029 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:47:08] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:51:51] * kart_ quickly updating cxserver (without RESTBase changes)
[11:52:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:53:30] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43998/console" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:53:34] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] SUL account linking, display success message. [software/bitu] - 10https://gerrit.wikimedia.org/r/965105 (owner: 10Slyngshede)
[11:54:10] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[11:54:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965107 (owner: 10Slyngshede)
[11:54:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: add support for puppetserver returning none 0 rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond)
[11:54:35] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[11:54:36] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm use service_fqdn to link correctly. [puppet] - 10https://gerrit.wikimedia.org/r/965107 (owner: 10Slyngshede)
[11:57:20] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[11:58:07] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[11:58:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, q inline" [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[11:59:36] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[12:00:03] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[12:00:40] <kart_>	 !log Updated cxserver to 2023-10-11-114410-production (T341478, T347939)
[12:00:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:52] <stashbot>	 T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939
[12:00:53] <stashbot>	 T341478: Port the markup transfer feature of cxserver to MinT - https://phabricator.wikimedia.org/T341478
[12:03:10] <wikibugs>	 (03PS1) 10Slyngshede: Add missing request parameter to message [software/bitu] - 10https://gerrit.wikimedia.org/r/965115
[12:03:47] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add missing request parameter to message [software/bitu] - 10https://gerrit.wikimedia.org/r/965115 (owner: 10Slyngshede)
[12:06:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "CI failure is unrelated (I think bitly returning 401, works locally)" [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney)
[12:07:16] <wikibugs>	 (03PS1) 10Clément Goubert: aux-k8s-ctrl: Fix missing PTR record [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632)
[12:08:11] <wikibugs>	 (03CR) 10Muehlenhoff: Assign apt_repo role to apt1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[12:09:05] <wikibugs>	 (03PS1) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119
[12:09:30] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1029 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:12:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.dns.netbox
[12:15:20] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ORES svc records - elukey@cumin1001"
[12:16:03] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Typo in the PTR" [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) (owner: 10Clément Goubert)
[12:16:04] <wikibugs>	 (03CR) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar)
[12:16:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ORES svc records - elukey@cumin1001"
[12:16:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:16:29] <wikibugs>	 (03PS2) 10JMeybohm: service::catalog: Add mw-wikifunctions - 1 [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544)
[12:16:31] <wikibugs>	 (03PS1) 10JMeybohm: Add mw-wikifunctions to mediawiki k8s releases [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544)
[12:16:37] <wikibugs>	 (03PS1) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543)
[12:17:00] <wikibugs>	 (03CR) 10Hashar: zuul: get ssh key from Puppet collected resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965106 (owner: 10Hashar)
[12:17:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond)
[12:18:58] <wikibugs>	 (03PS2) 10Clément Goubert: aux-k8s-ctrl: Fix missing PTR record [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632)
[12:19:19] <wikibugs>	 (03CR) 10Muehlenhoff: Fix wording (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 (owner: 10Slyngshede)
[12:19:27] <wikibugs>	 (03CR) 10Clément Goubert: aux-k8s-ctrl: Fix missing PTR record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) (owner: 10Clément Goubert)
[12:19:32] <wikibugs>	 (03CR) 10Jbond: "see comments in line, my change is failing pcc so there are still some bits to fix up but it shouldn't be too difficult to fix up" [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar)
[12:20:56] <wikibugs>	 (03PS2) 10Slyngshede: Update wording to read more clearly. [software/bitu] - 10https://gerrit.wikimedia.org/r/965120
[12:21:16] <wikibugs>	 (03CR) 10Slyngshede: Update wording to read more clearly. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 (owner: 10Slyngshede)
[12:21:50] <wikibugs>	 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) I really think that we need to find a solution for this. It has been pending for too long.  Today I did a check of the dns repository a...
[12:23:14] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092
[12:23:23] <wikibugs>	 (03CR) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[12:23:34] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Add mw-wikifunctions to mediawiki k8s releases [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[12:23:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[12:23:49] <wikibugs>	 (03PS2) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543)
[12:24:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond)
[12:24:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: test: dump response body on runbook fetch failure [alerts] - 10https://gerrit.wikimedia.org/r/965123
[12:25:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 (owner: 10Slyngshede)
[12:25:53] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Update wording to read more clearly. [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 (owner: 10Slyngshede)
[12:27:16] <wikibugs>	 (03PS1) 10Elukey: role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278)
[12:28:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Apparently no more 401, at least not right now, merging anyways" [alerts] - 10https://gerrit.wikimedia.org/r/965123 (owner: 10Filippo Giunchedi)
[12:28:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] test: dump response body on runbook fetch failure [alerts] - 10https://gerrit.wikimedia.org/r/965123 (owner: 10Filippo Giunchedi)
[12:30:04] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44001/console" [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[12:31:30] <wikibugs>	 (03PS2) 10Elukey: role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278)
[12:32:16] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thx" [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) (owner: 10Clément Goubert)
[12:33:14] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[12:34:34] <wikibugs>	 (03PS2) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543)
[12:34:52] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:34:53] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[12:37:12] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Cleanup decommissioned services apple-search and graphoid - cgoubert@cumin1001"
[12:38:03] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Cleanup decommissioned services apple-search and graphoid - cgoubert@cumin1001"
[12:38:03] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:38:12] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] aux-k8s-ctrl: Fix missing PTR record [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) (owner: 10Clément Goubert)
[12:39:56] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[12:41:01] <wikibugs>	 (03CR) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar)
[12:42:31] <wikibugs>	 10SRE, 10Discovery-Search, 10collaboration-services, 10serviceops, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) 05Open→03Resolved Done
[12:42:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] kube-state-metrics: create image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková)
[12:43:22] <wikibugs>	 (03PS3) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092
[12:43:36] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revertrisk-la docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965146 (https://phabricator.wikimedia.org/T347550)
[12:44:08] <wikibugs>	 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Done
[12:44:25] <wikibugs>	 (03PS1) 10Clément Goubert: tegola-vector-tiles: Fix missing PTR [dns] - 10https://gerrit.wikimedia.org/r/965147 (https://phabricator.wikimedia.org/T348631)
[12:44:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng: Add namespace for wikifunctions mediawiki deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[12:45:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[12:47:03] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Add namespace for wikifunctions mediawiki deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[12:48:35] <wikibugs>	 (03PS3) 10JMeybohm: Add mw-wikifunctions records [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544)
[12:51:21] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:51:55] <wikibugs>	 (03PS3) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543)
[12:52:00] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:52:08] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[12:52:11] <wikibugs>	 (03CR) 10Jbond: "FYI i updated the following to include this" [puppet] - 10https://gerrit.wikimedia.org/r/965103 (owner: 10Hashar)
[12:52:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond)
[12:53:45] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[12:53:47] <wikibugs>	 (03PS1) 10Cathal Mooney: Add puppet elements for newly added switches. [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230)
[12:53:55] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[12:53:58] <wikibugs>	 (03CR) 10Jbond: "this approach is fine but would still leave a bit of duplication in the labs profile" [puppet] - 10https://gerrit.wikimedia.org/r/965106 (owner: 10Hashar)
[12:54:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add puppet elements for newly added switches. [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[12:55:14] <wikibugs>	 (03CR) 10Jbond: ci: add Gerrit ssh key to ssh_known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar)
[12:55:33] <wikibugs>	 (03PS2) 10Cathal Mooney: Add puppet elements for newly added switches. [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230)
[12:55:39] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[12:56:05] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:56:46] <wikibugs>	 (03PS1) 10Slyngshede: Minor styling updates [software/bitu] - 10https://gerrit.wikimedia.org/r/965150
[12:56:54] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] hiera: announce ns1 IP from bird (codfw) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh)
[12:56:56] <wikibugs>	 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro)
[12:57:56] <wikibugs>	 (03CR) 10Jbond: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[12:58:27] <wikibugs>	 (03CR) 10Clément Goubert: Add mediawiki deployment for wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[12:58:33] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Minor styling updates [software/bitu] - 10https://gerrit.wikimedia.org/r/965150 (owner: 10Slyngshede)
[12:58:52] <wikibugs>	 (03PS4) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543)
[12:59:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond)
[12:59:20] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet']
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1300).
[13:00:05] <jouncebot>	 TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:54] * TheresNoTime is going to remove that 
[13:02:06] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[13:02:28] <wikibugs>	 (03PS5) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543)
[13:02:49] <wikibugs>	 (03CR) 10JMeybohm: Add mediawiki deployment for wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:03:25] <wikibugs>	 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10aborrero)
[13:05:34] <wikibugs>	 (03PS3) 10JMeybohm: Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544)
[13:06:24] <wikibugs>	 (03PS1) 10Elukey: api-gateway: add Content-type in the CORS' allowed headers settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/965153 (https://phabricator.wikimedia.org/T348511)
[13:06:28] <wikibugs>	 (03CR) 10JMeybohm: Add mediawiki deployment for wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:06:30] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Add mediawiki deployment for wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:07:28] <wikibugs>	 (03PS4) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092
[13:07:40] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:07:47] <wikibugs>	 (03CR) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[13:08:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:09:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[13:10:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[13:11:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] hiera: announce ns1 IP from bird (codfw) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh)
[13:12:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM! Thanks for spotting this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965153 (https://phabricator.wikimedia.org/T348511) (owner: 10Elukey)
[13:13:19] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: ceph: cleanup summaries of existing alerts [alerts] - 10https://gerrit.wikimedia.org/r/965154
[13:13:21] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: ceph: add alert for slow ops [alerts] - 10https://gerrit.wikimedia.org/r/965155
[13:13:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[13:14:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Might be worth running PCC on alert1001 and install1004 just in case." [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[13:14:43] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[13:14:44] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[13:15:55] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003.eqiad.wmnet']
[13:16:07] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet']
[13:16:17] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003.eqiad.wmnet']
[13:16:33] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet']
[13:18:26] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] api-gateway: add Content-type in the CORS' allowed headers settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/965153 (https://phabricator.wikimedia.org/T348511) (owner: 10Elukey)
[13:18:43] <wikibugs>	 (03PS2) 10JMeybohm: Add mw-wikifunctions to mediawiki k8s releases [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544)
[13:18:45] <wikibugs>	 (03PS3) 10JMeybohm: service::catalog: Add mw-wikifunctions - 1 [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544)
[13:18:47] <wikibugs>	 (03PS1) 10JMeybohm: deployment_server: Add mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/965156 (https://phabricator.wikimedia.org/T347544)
[13:20:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] deployment_server: Add mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/965156 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:23:32] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[13:24:03] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye
[13:24:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye
[13:24:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @MoritzMuehlenhoff thanks
[13:24:58] <urandom>	 !log starting decommission of restbase2012-a — T328490
[13:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:01] <stashbot>	 T328490: restbase cluster: decommission end-of-life hosts - https://phabricator.wikimedia.org/T328490
[13:25:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2497
[13:25:56] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2497
[13:26:02] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: block dockerhub on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/965157 (https://phabricator.wikimedia.org/T320730)
[13:26:21] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6368
[13:26:35] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6368
[13:26:51] <Kemayo>	 TheresNoTime: is the backport window still sufficiently open that I could sneak something in, or should I wait for the next one?
[13:27:11] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9031
[13:27:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9031
[13:27:27] <TheresNoTime>	 Kemayo: go ahead if you can deploy :)
[13:27:49] <Kemayo>	 I cannot deploy, unfortunately.
[13:27:58] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1003.eqiad.wmnet']
[13:28:12] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Enable logging of remote IPs. [puppet] - 10https://gerrit.wikimedia.org/r/963258 (owner: 10Slyngshede)
[13:28:38] <sukhe>	 !log disable puppet on P:bird::anycast
[13:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:42] <sukhe>	 !log disable puppet on P:bird::anycast: T348041
[13:28:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:46] <stashbot>	 T348041: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041
[13:28:48] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151)
[13:29:20] <TheresNoTime>	 Kemayo: I'm away from my laptop, what did you want to get deployed?
[13:30:01] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44004/console" [puppet] - 10https://gerrit.wikimedia.org/r/965157 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto)
[13:30:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "The SSH key management module works fine and is ready to go live (also tested email changes and implicitly the new theming), let's update " [software/bitu] - 10https://gerrit.wikimedia.org/r/959211 (owner: 10Slyngshede)
[13:30:11] <wikibugs>	 (03PS1) 10JMeybohm: Remove namespace quota and limitranger from mw-wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965158 (https://phabricator.wikimedia.org/T347544)
[13:30:19] <Kemayo>	 TheresNoTime: I had a config change and a backport of a change to VE. If it can't happen right now, that's fine, I can make it to the late window.
[13:30:36] <TheresNoTime>	 Probably best, sorry! :)
[13:30:44] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38195
[13:30:55] <Kemayo>	 TheresNoTime: 👍🏻
[13:31:22] <Lucas_WMDE>	 I’m around if needed
[13:31:28] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38195
[13:31:49] <Lucas_WMDE>	 jouncebot: next
[13:31:49] <jouncebot>	 In 0 hour(s) and 28 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1400)
[13:31:49] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: announce ns1 IP from bird (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh)
[13:31:51] <Lucas_WMDE>	 hm
[13:32:08] <Lucas_WMDE>	 Kemayo: how closely related are the config change and backport? I’m not sure there’s time for both
[13:32:37] <Lucas_WMDE>	 but I could probably deploy one at least, if that’s useful
[13:32:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Remove namespace quota and limitranger from mw-wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965158 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:34:09] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[13:34:46] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 40317
[13:35:23] <wikibugs>	 (03Merged) 10jenkins-bot: Remove namespace quota and limitranger from mw-wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965158 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:35:47] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40317
[13:36:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38628
[13:36:12] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[13:36:29] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38628
[13:37:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 150552
[13:37:23] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 150552
[13:37:32] <Kemayo>	 Lucas_WMDE: sadly they both need to go in. The backport could go without the config because it won’t actually be active without it, I guess…
[13:37:34] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[13:37:53] <Lucas_WMDE>	 eh, I could still do the backport then
[13:37:57] <Lucas_WMDE>	 so the late window goes faster ^^
[13:37:58] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[13:38:00] <Lucas_WMDE>	 wdyt?
[13:38:08] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) 05Resolved→03Open
[13:38:29] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) @Papaul can I loop you in here, please? You've previously managed to successfully configure hardware like this as JBOD, but it seems to...
[13:38:30] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[13:38:32] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[13:39:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10MatthewVernon) 05Resolved→03Open [re-opening this as the JBOD issue still needs resolving, similar to T342674]
[13:39:43] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[13:39:44] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[13:40:17] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[13:40:45] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[13:41:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10MatthewVernon) a:05MatthewVernon→03None
[13:41:34] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10MatthewVernon)
[13:41:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover testreduce to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/965163 (https://phabricator.wikimedia.org/T345220)
[13:42:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thx" [dns] - 10https://gerrit.wikimedia.org/r/965147 (https://phabricator.wikimedia.org/T348631) (owner: 10Clément Goubert)
[13:42:15] <wikibugs>	 (03PS2) 10Muehlenhoff: Failover testreduce to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/965163 (https://phabricator.wikimedia.org/T345220)
[13:42:40] <Kemayo>	 Lucas_WMDE: sure, that works! They’re in the late backport window on Deployments now, if you want to grab the commands.
[13:42:43] <elukey>	 !log restart kube-apiserver on ml-serve-ctrl1001 as attempt to clear a weird golang/protobuf issue while retrieving secrets
[13:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10MatthewVernon) moss-be1003 is now in place (cf T342675) so could the NVME card be installed please @Jclark-ctr ?
[13:43:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10MatthewVernon) a:05MatthewVernon→03None
[13:43:13] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:43:18] <wikibugs>	 (03PS1) 10Slyngshede: P:idm improve apache2 logging. [puppet] - 10https://gerrit.wikimedia.org/r/965165
[13:43:45] <Lucas_WMDE>	 Kemayo: okay! can the backport alone be tested?
[13:43:46] <wikibugs>	 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, and 2 others: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10Jdforrester-WMF)
[13:44:03] <Kemayo>	 Lucas_WMDE: this one, to be specific: https://gerrit.wikimedia.org/r/c/963042/
[13:44:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10MatthewVernon) moss-be2003 is now on site (cf T342674) so could this NVME card now be installed, please, @Jhancock.wm ?
[13:44:12] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Enable SSH key management for all users. [software/bitu] - 10https://gerrit.wikimedia.org/r/959211 (owner: 10Slyngshede)
[13:44:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963042 (owner: 10DLynch)
[13:44:39] <Lucas_WMDE>	 (just wondering whether I should wait for your confirmation when it’s on the test servers, or sync it directly)
[13:45:06] <Lucas_WMDE>	 also I’m guessing the wmf.28 backport is obsolete now :)
[13:45:25] <elukey>	 !log restart kube-apiserver on ml-serve-ctrl1002
[13:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:54] <Kemayo>	 Yeah, this was all originally
[13:46:04] <Kemayo>	 Going to be deployed last week, and so…
[13:46:32] <Lucas_WMDE>	 ok, I see
[13:46:46] <Kemayo>	 But yes. Sync it directly — there’s no way for me to actually test it without the config patch also being out.
[13:46:55] <Lucas_WMDE>	 ack, thanks!
[13:46:59] <Lucas_WMDE>	 I’ll do that then
[13:47:04] <Lucas_WMDE>	 and good luck tonight ^^
[13:47:35] <Kemayo>	 Thanks for the help!
[13:48:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) If it can be useful, I generated a summary of `Offline_Uncorrectable` sectors per host: https://phabricator.wikimedia.org/P52907
[13:49:12] <wikibugs>	 (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/964506 (owner: 10L10n-bot)
[13:49:14] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44005/console" [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[13:50:52] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[13:53:05] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Jhancock.wm) @MatthewVernon the card is installed.
[13:54:49] <wikibugs>	 (03PS1) 10Ayounsi: set anycast4 orlonger instead of longer [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041)
[13:55:16] <wikibugs>	 10SRE, 10ops-eqiad: Broken disk on ganeti1022 - https://phabricator.wikimedia.org/T348429 (10Jclark-ctr) 05Open→03Resolved
[13:55:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Assign apt_repo role to apt1002 [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[13:56:15] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "FWIW :)" [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) (owner: 10Ayounsi)
[13:56:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) (owner: 10Ayounsi)
[13:56:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] set anycast4 orlonger instead of longer [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) (owner: 10Ayounsi)
[13:56:53] <wikibugs>	 10SRE, 10observability, 10SRE Observability (FY2023/2024-Q2): Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10herron) 05Open→03Resolved a:03herron Done!
[13:57:07] <wikibugs>	 (03Merged) 10jenkins-bot: set anycast4 orlonger instead of longer [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) (owner: 10Ayounsi)
[13:58:17] <wikibugs>	 (03Merged) 10jenkins-bot: Edit check: Simplify "experience" config to "maximumEditcount" [extensions/VisualEditor] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963042 (owner: 10DLynch)
[13:58:33] <logmsgbot>	 !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye
[13:58:51] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye
[13:58:57] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10MatthewVernon)
[13:58:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963042|Edit check: Simplify "experience" config to "maximumEditcount"]]
[13:59:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye
[13:59:11] * Lucas_WMDE acks TheresNoTime’s beta-only change on deploy2002
[13:59:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Oh, yes, so it is, sorry.
[14:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1400)
[14:00:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kemayo: Backport for [[gerrit:963042|Edit check: Simplify "experience" config to "maximumEditcount"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:00:22] <Lucas_WMDE>	 I’m still deploying, sorry
[14:00:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kemayo: Continuing with sync
[14:00:36] <Lucas_WMDE>	 maybe ~4 more minutes or so
[14:01:25] <Kemayo>	 No worries from me
[14:01:51] <Lucas_WMDE>	 Kemayo: that was mainly directed at the people doing the Wikifunction Services window
[14:02:08] <Lucas_WMDE>	 (if the window had any IRC nicks in it I could ping them to let them know they’re not yet free to go…)
[14:02:23] <Kemayo>	 🤔
[14:03:17] <wikibugs>	 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T348550 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:05:29] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED
[14:06:02] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: update revertrisk-la docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965146 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[14:06:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963042|Edit check: Simplify "experience" config to "maximumEditcount"]] (duration: 07m 13s)
[14:06:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:06:17] * Lucas_WMDE done
[14:06:24] <Lucas_WMDE>	 if anyone wants to deploy wikifunctions services now :)
[14:06:35] <wikibugs>	 (03CR) 10JMeybohm: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[14:07:14] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED
[14:07:20] <jayme>	 I would like to do something different what would interfere with mw deployments. So I'll thankfully take the headsup :)
[14:09:00] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Add mw-wikifunctions to mediawiki k8s releases [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[14:09:48] <wikibugs>	 (03CR) 10Kamila Součková: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková)
[14:10:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] kube-state-metrics: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková)
[14:10:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) I opened up a ticket with dell for 1 server right now   Confirmed: Service Request 177592506 was successfully submitted.
[14:13:07] <logmsgbot>	 !log jayme@deploy2002 Started scap: (no justification provided)
[14:15:22] <logmsgbot>	 !log jayme@deploy2002 Finished scap: (no justification provided) (duration: 02m 15s)
[14:16:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend acmechief config for new apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/965170 (https://phabricator.wikimedia.org/T331613)
[14:17:01] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED
[14:18:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add mw-wikifunctions records [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[14:18:15] <moritzm>	 !log installing curl security updates on bullseye/bookworm
[14:18:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:05] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED
[14:21:07] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[14:21:13] <wikibugs>	 (03Abandoned) 10DLynch: Edit check: Simplify "experience" config to "maximumEditcount" [extensions/VisualEditor] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963041 (owner: 10DLynch)
[14:21:17] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[14:22:37] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1101']
[14:23:05] <wikibugs>	 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, and 2 others: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm)
[14:23:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[14:24:07] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[14:24:31] <wikibugs>	 (03PS2) 10Clément Goubert: tegola-vector-tiles: Fix missing PTR [dns] - 10https://gerrit.wikimedia.org/r/965147 (https://phabricator.wikimedia.org/T348631)
[14:25:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[14:25:16] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[14:25:17] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[14:25:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[14:26:11] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: correct paths for edit, editor and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/965173 (https://phabricator.wikimedia.org/T347027)
[14:27:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Move restbase canary [puppet] - 10https://gerrit.wikimedia.org/r/965174 (https://phabricator.wikimedia.org/T328490)
[14:27:55] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] kube-state-metrics: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková)
[14:28:13] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] tegola-vector-tiles: Fix missing PTR [dns] - 10https://gerrit.wikimedia.org/r/965147 (https://phabricator.wikimedia.org/T348631) (owner: 10Clément Goubert)
[14:28:45] <claime>	 !log Running authdns-update - T348631
[14:28:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:49] <stashbot>	 T348631: tegola-vector-tiles SVC records missing reverse PTRs - https://phabricator.wikimedia.org/T348631
[14:29:56] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: correct paths for edit, editor and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/965173 (https://phabricator.wikimedia.org/T347027) (owner: 10Hnowlan)
[14:30:27] <icinga-wm>	 PROBLEM - Check systemd state on apt1002 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:49] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: correct paths for edit, editor and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/965173 (https://phabricator.wikimedia.org/T347027) (owner: 10Hnowlan)
[14:31:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:31:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] service::catalog: Add mw-wikifunctions - 1 [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[14:33:45] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:34:16] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: update revertrisk-la docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965146 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[14:35:09] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revertrisk-la docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965146 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[14:36:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:37:05] <wikibugs>	 (03PS1) 10JMeybohm: service::catalog: Add mw-wikifunctions - 2 [puppet] - 10https://gerrit.wikimedia.org/r/965175 (https://phabricator.wikimedia.org/T347544)
[14:37:07] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) @MatthewVernon  Installed last nvme card into moss-be1003
[14:37:09] <wikibugs>	 (03PS1) 10JMeybohm: service::catalog: Add mw-wikifunctions - 3 [puppet] - 10https://gerrit.wikimedia.org/r/965176 (https://phabricator.wikimedia.org/T347544)
[14:37:11] <wikibugs>	 (03PS1) 10JMeybohm: service::catalog: Add mw-wikifunctions - 4 [puppet] - 10https://gerrit.wikimedia.org/r/965177 (https://phabricator.wikimedia.org/T347544)
[14:37:24] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10Jclark-ctr)
[14:37:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[14:38:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:17] <wikibugs>	 (03PS3) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119
[14:39:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] service::catalog: Add mw-wikifunctions - 2 [puppet] - 10https://gerrit.wikimedia.org/r/965175 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[14:40:03] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:49] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[14:42:59] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Clement_Goubert) 05Open→03In progress
[14:43:25] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:18] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::k8s: drop unused labels for k8s-pods-kserve [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456)
[14:45:26] <jayme>	 !log disabling puppet on 'P{O:lvs::balancer} and (A:codfw or A:eqiad)'
[14:45:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff)
[14:45:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff)
[14:46:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] service::catalog: Add mw-wikifunctions - 2 [puppet] - 10https://gerrit.wikimedia.org/r/965175 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[14:47:54] <wikibugs>	 (03PS5) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092
[14:48:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, and 2 others: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10Jclark-ctr) @taavi  What vlan are these going to be   I would like to verify with @cmooney  that these can go into these racks before i physically move them.
[14:48:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:29] <jayme>	 !log running puppet on 'O:lvs::balancer'
[14:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:39] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Move restbase canary [puppet] - 10https://gerrit.wikimedia.org/r/965174 (https://phabricator.wikimedia.org/T328490) (owner: 10Muehlenhoff)
[14:50:43] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: block dockerhub on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/965157 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto)
[14:52:07] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:52:10] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44006/console" [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey)
[14:52:15] <jayme>	 !log restarting pybal on lvs1020 and lvs2014
[14:52:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Move restbase canary [puppet] - 10https://gerrit.wikimedia.org/r/965174 (https://phabricator.wikimedia.org/T328490) (owner: 10Muehlenhoff)
[14:54:13] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[14:54:39] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:54:53] <jayme>	 this is me
[14:55:19] <jayme>	 !log restarting pybal on lvs1019 and lvs2013
[14:55:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:27] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 78 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[14:55:51] <vgutierrez>	 that's jayme :)
[14:56:28] <jayme>	 thats right :)
[14:57:29] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:57:46] <jayme>	 thats kind of me as well
[14:59:20] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: admin_ng/ml-serve: add namespace permissions for llm [puppet] - 10https://gerrit.wikimedia.org/r/965180 (https://phabricator.wikimedia.org/T348661)
[14:59:52] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:59:56] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, and 2 others: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) Thanks @Jclark-ctr yes these can go in E4 or F4 no problem.
[15:00:00] <icinga-wm>	 PROBLEM - HTTP on apt1002 is CRITICAL: connect to address 208.80.154.10 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/APT_repository
[15:00:20] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:26] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[15:00:28] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964946 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan)
[15:01:33] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] service::catalog: Add mw-wikifunctions - 3 [puppet] - 10https://gerrit.wikimedia.org/r/965176 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[15:01:42] <icinga-wm>	 PROBLEM - HTTPS on apt1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/APT_repository
[15:03:03] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: admin_ng: add llm namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/965181 (https://phabricator.wikimedia.org/T348661)
[15:04:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on apt1002.wikimedia.org with reason: setup in progress
[15:05:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on apt1002.wikimedia.org with reason: setup in progress
[15:05:56] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: route pageviews to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/964946 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan)
[15:07:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Always restart parsoid-rt/parsoid-rt-client on failures [puppet] - 10https://gerrit.wikimedia.org/r/965183
[15:09:49] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng/ml-serve: add namespace permissions for llm [puppet] - 10https://gerrit.wikimedia.org/r/965180 (https://phabricator.wikimedia.org/T348661) (owner: 10Ilias Sarantopoulos)
[15:09:54] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: add llm namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/965181 (https://phabricator.wikimedia.org/T348661) (owner: 10Ilias Sarantopoulos)
[15:10:29] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::prometheus::k8s: drop unused labels for k8s-pods-kserve [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey)
[15:12:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] aborrero: drop access [labs/private] - 10https://gerrit.wikimedia.org/r/964926 (owner: 10Arturo Borrero Gonzalez)
[15:12:33] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: correct pageviews paths [puppet] - 10https://gerrit.wikimedia.org/r/965184 (https://phabricator.wikimedia.org/T336391)
[15:12:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] service::catalog: Add mw-wikifunctions - 4 [puppet] - 10https://gerrit.wikimedia.org/r/965177 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[15:14:14] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:15:39] <wikibugs>	 (03PS1) 10Ssingh: hiera: announce ns0 IP from bird (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/965187 (https://phabricator.wikimedia.org/T348041)
[15:16:17] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10MatthewVernon)
[15:16:52] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44007/console" [puppet] - 10https://gerrit.wikimedia.org/r/965187 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh)
[15:17:14] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] admin_ng/ml-serve: add namespace permissions for llm [puppet] - 10https://gerrit.wikimedia.org/r/965180 (https://phabricator.wikimedia.org/T348661) (owner: 10Ilias Sarantopoulos)
[15:17:49] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] admin_ng: add llm namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/965181 (https://phabricator.wikimedia.org/T348661) (owner: 10Ilias Sarantopoulos)
[15:18:09] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5001.eqsin.wmnet with OS bookworm
[15:18:18] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm
[15:18:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "To be merged tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/965187 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh)
[15:20:13] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:20:43] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507)
[15:20:45] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:21:08] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] trafficserver: correct pageviews paths [puppet] - 10https://gerrit.wikimedia.org/r/965184 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan)
[15:21:36] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[15:21:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: correct pageviews paths [puppet] - 10https://gerrit.wikimedia.org/r/965184 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan)
[15:22:09] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[15:22:12] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Always restart parsoid-rt/parsoid-rt-client on failures [puppet] - 10https://gerrit.wikimedia.org/r/965183 (owner: 10Muehlenhoff)
[15:22:51] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:23:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:23:09] <wikibugs>	 (03PS3) 10JMeybohm: Add mw-wikifunctions discovery records [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544)
[15:23:11] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:24:35] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos)
[15:24:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add mw-wikifunctions discovery records [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[15:25:00] <wikibugs>	 (03PS6) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092
[15:25:04] <vgutierrez>	 !log depool ncredir5001
[15:25:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:40] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye
[15:26:22] <wikibugs>	 (03CR) 10Muehlenhoff: "The earlier version has a logic error; having a ferm::service without $ferm_srange is actually supported and results in a firewall def wit" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[15:26:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Always restart parsoid-rt/parsoid-rt-client on failures [puppet] - 10https://gerrit.wikimedia.org/r/965183 (owner: 10Muehlenhoff)
[15:27:08] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:27:47] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: APIGW: add entry for llm langid LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507)
[15:27:58] <wikibugs>	 10SRE, 10ops-eqiad, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{1001..1009}.eqiad.wmnet - https://phabricator.wikimedia.org/T348144 (10Jclark-ctr) 05Open→03Resolved
[15:28:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{1001..1009}.eqiad.wmnet - https://phabricator.wikimedia.org/T348144 (10Jclark-ctr) a:03Jclark-ctr
[15:30:17] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff)
[15:32:16] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/965155 (owner: 10Majavah)
[15:34:23] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:36:03] <wikibugs>	 (03CR) 10Elukey: ml-services: add langid in llm namespace (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos)
[15:38:42] <wikibugs>	 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, and 2 others: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm)
[15:45:07] <wikibugs>	 (03PS1) 10Jclark-ctr: add stat1011 to autoinstall and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/965193 (https://phabricator.wikimedia.org/T342454)
[15:45:59] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add stat1011 to autoinstall and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/965193 (https://phabricator.wikimedia.org/T342454) (owner: 10Jclark-ctr)
[15:52:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye
[15:52:37] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1011.eqiad.wmnet with OS bullseye
[15:52:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye
[15:52:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed with errors:...
[15:53:36] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache mw-wikifunctions.discovery.wmnet on eqiad recursors
[15:53:36] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw-wikifunctions.discovery.wmnet on eqiad recursors
[15:53:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! Thank you for being mindful of extra labels/metrics" [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey)
[15:53:46] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache mw-wikifunctions.discovery.wmnet on codfw recursors
[15:53:47] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw-wikifunctions.discovery.wmnet on codfw recursors
[15:54:39] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5001.eqsin.wmnet with reason: host reimage
[15:55:20] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "👍" [alerts] - 10https://gerrit.wikimedia.org/r/965154 (owner: 10Majavah)
[15:55:31] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: ceph: cleanup summaries of existing alerts [alerts] - 10https://gerrit.wikimedia.org/r/965154 (owner: 10Majavah)
[15:55:34] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: ceph: add alert for slow ops [alerts] - 10https://gerrit.wikimedia.org/r/965155 (owner: 10Majavah)
[15:56:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-wmcs: ceph: cleanup summaries of existing alerts [alerts] - 10https://gerrit.wikimedia.org/r/965154 (owner: 10Majavah)
[15:56:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-wmcs: ceph: add alert for slow ops [alerts] - 10https://gerrit.wikimedia.org/r/965155 (owner: 10Majavah)
[15:57:26] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:41] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5001.eqsin.wmnet with reason: host reimage
[16:04:11] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:05:07] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:05:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:08:08] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos)
[16:10:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:23:03] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:13] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir5001.eqsin.wmnet with OS bookworm
[16:29:23] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm completed: - ncredir5001 (**PASS**)   - Removed from Pup...
[16:31:57] <wikibugs>	 (03PS1) 10Majavah: Don't double-escape link contents [extensions/GlobalBlocking] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965207 (https://phabricator.wikimedia.org/T348669)
[16:33:03] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:10] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Don't double-escape link contents [extensions/GlobalBlocking] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965207 (https://phabricator.wikimedia.org/T348669) (owner: 10Majavah)
[16:33:19] <taavi>	 jouncebot: nowandnext
[16:33:19] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 26 minute(s)
[16:33:19] <jouncebot>	 In 0 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1700)
[16:33:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965207 (https://phabricator.wikimedia.org/T348669) (owner: 10Majavah)
[16:35:48] <wikibugs>	 (03Merged) 10jenkins-bot: Don't double-escape link contents [extensions/GlobalBlocking] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965207 (https://phabricator.wikimedia.org/T348669) (owner: 10Majavah)
[16:36:17] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:965207|Don't double-escape link contents (T348669)]]
[16:36:21] <stashbot>	 T348669: GlobalBlocking navigation bar is double-escaped - https://phabricator.wikimedia.org/T348669
[16:37:23] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10thcipriani) Home directory was cleaned up. Removing our team tag since immediate problem was isolated, and SRE maintain the puppet-diff project. Ping if I missed anything!
[16:37:41] <logmsgbot>	 !log taavi@deploy2002 taavi: Backport for [[gerrit:965207|Don't double-escape link contents (T348669)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:38:07] <logmsgbot>	 !log taavi@deploy2002 taavi: Continuing with sync
[16:39:30] <wikibugs>	 (03PS1) 10DLynch: Remove override to allow mobile edit notices to display on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965205 (https://phabricator.wikimedia.org/T316178)
[16:40:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10ssingh) 05Open→03Resolved a:03ssingh Marking this as resolved as the person who initially reported this. Thanks for the help everyone!
[16:41:42] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10M2k_dewiki)
[16:42:44] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388)
[16:42:48] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388)
[16:42:49] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388)
[16:43:53] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:965207|Don't double-escape link contents (T348669)]] (duration: 07m 35s)
[16:43:56] * taavi done
[16:44:11] <stashbot>	 T348669: GlobalBlocking navigation bar is double-escaped - https://phabricator.wikimedia.org/T348669
[16:44:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye
[16:44:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye
[16:44:45] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1011.eqiad.wmnet with OS bullseye
[16:44:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed wit...
[16:46:07] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1011']
[16:47:35] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['stat1011']
[16:48:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye
[16:48:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1011.eqiad.wmnet with OS bullseye
[16:48:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye
[16:49:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed wit...
[16:50:16] <wikibugs>	 (03PS4) 10Jforrester: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388)
[16:50:18] <wikibugs>	 (03PS4) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388)
[16:50:20] <wikibugs>	 (03PS4) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388)
[16:50:22] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Simplify releases/environments config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965226
[16:51:58] <wikibugs>	 (03PS2) 10Ryan Kemper: rdf-streaming-updater: restrict space usage alert from 1TiB to 50GiB [alerts] - 10https://gerrit.wikimedia.org/r/964934 (owner: 10DCausse)
[16:52:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[16:53:19] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye
[16:53:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye
[16:55:19] <James_F>	 jouncebot: nowandnext
[16:55:19] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 4 minute(s)
[16:55:19] <jouncebot>	 In 0 hour(s) and 4 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1700)
[16:55:24] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Simplify releases/environments config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965226 (owner: 10Jforrester)
[16:55:26] <wikibugs>	 (03PS1) 10JMeybohm: Add appserver, api and jobrunner SANs to mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/965227 (https://phabricator.wikimedia.org/T347544)
[16:56:09] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:11] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Simplify releases/environments config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965226 (owner: 10Jforrester)
[16:57:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:57:12] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:58:38] <wikibugs>	 (03CR) 10JMeybohm: "Not sure if it's worth it to separate this by mw release. WDYT?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965227 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1700)
[17:00:29] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:19] <wikibugs>	 (03PS2) 10JMeybohm: Add appserver, api and jobrunner SANs to mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/965227 (https://phabricator.wikimedia.org/T347544)
[17:03:41] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3004.esams.wmnet with OS bookworm
[17:03:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir3004.esams.wmnet with OS bookworm
[17:05:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:10:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:12:05] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:22] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:15:37] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:16:25] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:43] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:22:16] <sukhe>	 er?
[17:24:09] <sukhe>	 2001:504:61:0:6:1374:0:1, ipv6.de-cix.dfw.us.as398196.cobaltridge.com. hm ok
[17:24:41] <taavi>	 hm what's up with the git_pull_charts alert?
[17:26:11] <taavi>	 hnowlan: there seem to be some local changes in /srv/deployment-charts blocking git pulls, and you have related-looking SAL entries, can you fix those?
[17:26:50] <hnowlan>	 taavi: agh, fixing 
[17:27:44] <hnowlan>	 taavi: done, thanks for the heads-up 
[17:27:57] <sukhe>	 !log repool cp2030 for service=cdn
[17:27:59] <wikibugs>	 (03CR) 10Bking: [C: 03+2] admin: Add cirrus-streaming-updater namespace to flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/964567 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[17:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:02] <wikibugs>	 (03PS1) 10Majavah: helpfile: Cleanup chart pull timer [puppet] - 10https://gerrit.wikimedia.org/r/965229
[17:28:07] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage
[17:28:07] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:28:33] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Four files repeatedly failed to upload today...
[17:30:37] <taavi>	 hnowlan: thanks, although now it looks `helmfile` is showing some diffs between the applied state and the values file
[17:32:09] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage
[17:35:36] <wikibugs>	 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10Lhiraide) Hi @NMariano-WMF that would be great! Thank you all so much for your help!
[17:36:29] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Don-vip) The problem is worse today, I have now 5 files that have not been uploaded: - https://co...
[17:41:59] <wikibugs>	 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10NMariano-WMF) Hi @Lhiraide and @ssingh, I sent out an invite tomorrow. I don't think we'll need the full time for the meeting, but wanted to be safe just in case we did. Let me know if that time doesn't...
[17:46:56] <James_F>	 jouncebot: nowandnext
[17:46:56] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1700)
[17:46:57] <jouncebot>	 In 0 hour(s) and 13 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1800)
[17:46:57] <jouncebot>	 In 0 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1800)
[17:47:00] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[17:47:04] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[17:47:09] <James_F>	 OK, good.
[17:49:22] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:49:59] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:50:32] <TheresNoTime>	 is logstash working okay for everyone? Searches aren't returning any results, and there's a lot of "Could not index event to OpenSearch. status: 400"
[17:51:17] <James_F>	 TheresNoTime: I just loaded the MW-NEW-errors dash OK. Is it a specific dash that's broken? Or a timerange?
[17:52:28] <TheresNoTime>	 oh wait one..
[17:52:47] <MatmaRex>	 TheresNoTime: if you see "dlq-*" in top-left corner, change it to "logstash-*"
[17:52:50] <TheresNoTime>	 huh, okay, false alarm — the "index pattern" has changed..
[17:52:53] <TheresNoTime>	 yeah
[17:52:57] <James_F>	 Aha.
[17:53:04] <MatmaRex>	 i was also confused by that a few days ago
[17:53:24] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[17:54:21] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[17:54:59] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:55:40] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[17:55:49] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3004.esams.wmnet with OS bookworm
[17:55:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir3004.esams.wmnet with OS bookworm completed: - ncredir3004 (**WARN**)   - Downtimed on Ici...
[17:56:02] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[17:56:13] <wikibugs>	 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) @NMariano-WMF: Thanks, accepted!
[17:56:31] <wikibugs>	 (03PS5) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388)
[17:56:33] <wikibugs>	 (03PS5) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388)
[17:59:43] <wikibugs>	 (03PS1) 10RLazarus: Revert "admin: Temporarily add a second ssh key for rzl" [puppet] - 10https://gerrit.wikimedia.org/r/965209
[18:00:04] <jouncebot>	 hashar and jeena: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1800)
[18:00:04] <jouncebot>	 hashar and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1800). nyaa~
[18:00:48] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Revert "admin: Temporarily add a second ssh key for rzl" [puppet] - 10https://gerrit.wikimedia.org/r/965209 (owner: 10RLazarus)
[18:01:01] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Define different ports for different service releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/965234 (https://phabricator.wikimedia.org/T343388)
[18:02:34] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Define different ports for different service releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/965234 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[18:03:23] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Define different ports for different service releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/965234 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[18:04:44] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:05:26] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:06:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[18:07:06] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[18:07:38] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[18:07:42] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[18:08:10] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[18:14:02] <wikibugs>	 (03PS6) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388)
[18:14:04] <wikibugs>	 (03PS6) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388)
[18:14:06] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Move orchestrator config from chart to service values [deployment-charts] - 10https://gerrit.wikimedia.org/r/965237 (https://phabricator.wikimedia.org/T343388)
[18:15:28] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:17:11] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Move orchestrator config from chart to service values [deployment-charts] - 10https://gerrit.wikimedia.org/r/965237 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[18:17:48] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[18:18:16] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Move orchestrator config from chart to service values [deployment-charts] - 10https://gerrit.wikimedia.org/r/965237 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[18:18:47] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3003.esams.wmnet with OS bookworm
[18:18:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52910 and previous config saved to /var/cache/conftool/dbconfig/20231011-181849-arnaudb.json
[18:18:53] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[18:18:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir3003.esams.wmnet with OS bookworm
[18:19:08] <wikibugs>	 (03PS1) 10Jforrester: specials: Use correct title in NewPagesPager [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965211 (https://phabricator.wikimedia.org/T348665)
[18:19:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stat1011.eqiad.wmnet with reason: host reimage
[18:21:06] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:21:50] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:22:08] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:22:27] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[18:23:03] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat1011.eqiad.wmnet with reason: host reimage
[18:23:17] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[18:23:21] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[18:24:10] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[18:24:46] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[18:25:00] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:27:19] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[18:28:13] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[18:28:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:31:02] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:31:33] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:32:06] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[18:33:01] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[18:33:05] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[18:33:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:33:53] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[18:33:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P52911 and previous config saved to /var/cache/conftool/dbconfig/20231011-183355-arnaudb.json
[18:34:32] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[18:35:22] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester)
[18:35:46] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:35:49] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:36:01] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[18:36:04] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[18:36:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[18:36:09] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[18:43:06] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage
[18:43:29] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Rev charts to 0.2.0, move TODOs around for clarity [deployment-charts] - 10https://gerrit.wikimedia.org/r/965239
[18:45:00] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10M2k_dewiki)
[18:46:15] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage
[18:46:57] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "This is a much bigger diff than expected! To investigate." [deployment-charts] - 10https://gerrit.wikimedia.org/r/965239 (owner: 10Jforrester)
[18:47:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) I lab tested this and the "always-compare-med" command works as expected (see P52912).  >>! In T348446#9238640, @ayounsi wrote: > Some of our...
[18:48:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[18:49:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P52913 and previous config saved to /var/cache/conftool/dbconfig/20231011-184902-arnaudb.json
[18:49:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[18:49:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stat1011.eqiad.wmnet with OS bullseye
[18:49:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye completed: -...
[18:53:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr)
[18:53:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) 05Open→03Resolved
[19:04:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52914 and previous config saved to /var/cache/conftool/dbconfig/20231011-190408-arnaudb.json
[19:04:18] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[19:08:13] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1101']
[19:10:12] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3003.esams.wmnet with OS bookworm
[19:10:23] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir3003.esams.wmnet with OS bookworm completed: - ncredir3003 (**PASS**)   - Downtimed on Ici...
[19:12:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[19:12:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir2002.codfw.wmnet with OS bookworm
[19:12:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir2002.codfw.wmnet with OS bookworm
[19:14:15] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] rdf-streaming-updater: restrict space usage alert from 1TiB to 50GiB [alerts] - 10https://gerrit.wikimedia.org/r/964934 (owner: 10DCausse)
[19:15:29] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: restrict space usage alert from 1TiB to 50GiB [alerts] - 10https://gerrit.wikimedia.org/r/964934 (owner: 10DCausse)
[19:23:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:27:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF)
[19:32:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF)
[19:37:10] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir2002.codfw.wmnet with reason: host reimage
[19:40:06] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir2002.codfw.wmnet with reason: host reimage
[19:43:26] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Don-vip) Another one: - https://commons.wikimedia.org/wiki/File:2016GHRCUWGMeeting_(29211169404)....
[19:44:14] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1102.mgmt.eqiad.wmnet with reboot policy FORCED
[19:49:22] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:52:20] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:54:38] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir2002.codfw.wmnet with OS bookworm
[19:54:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir2002.codfw.wmnet with OS bookworm completed: - ncredir2002 (**WARN**)   - Downtimed on Ici...
[19:55:20] <wikibugs>	 (03PS2) 10Samtar: Enable Edit Check on initial partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963084 (https://phabricator.wikimedia.org/T347908) (owner: 10DLynch)
[19:56:47] <Kizule>	 Hi, is there some maintenance or something like that? Commons is throwing me "Failed to commit operations" when I'm using FileImporter for moving files from Serbian Wikipedia to Commons?
[19:57:58] <Kizule>	 So far, second file wasn't completly imported, because of that error, so I had to ask on #wikimedia-commons for deletion. Files don't have much revisions, just about 5-6 revisions.
[19:58:10] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:58:11] <Kizule>	 And 3 previous versions of images.
[20:00:01] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1104']
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T2000)
[20:00:05] <jouncebot>	 kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1104']
[20:00:20] <Kemayo>	 👋🏻
[20:00:27] <TheresNoTime>	 I can deploy :)
[20:00:37] <TheresNoTime>	 Kizule: could you log a task?
[20:00:52] <TheresNoTime>	 Kemayo: starting with 963084
[20:00:56] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963084 (https://phabricator.wikimedia.org/T347908) (owner: 10DLynch)
[20:01:30] <Kemayo>	 TheresNoTime: Sounds good
[20:01:56] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Edit Check on initial partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963084 (https://phabricator.wikimedia.org/T347908) (owner: 10DLynch)
[20:02:25] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:963084|Enable Edit Check on initial partner wikis (T347908)]]
[20:02:39] <stashbot>	 T347908: [Config] Enable Edit Check (References) at initial partner wikis - https://phabricator.wikimedia.org/T347908
[20:02:50] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:02:52] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[20:03:16] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[20:03:20] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[20:03:47] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[20:03:49] <logmsgbot>	 !log samtar@deploy2002 samtar and kemayo: Backport for [[gerrit:963084|Enable Edit Check on initial partner wikis (T347908)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:03:51] <TheresNoTime>	 Kizule: I did a little bit of digging, maybe T348688
[20:03:58] <TheresNoTime>	 Kemayo: live on mwdebug, can you test? :)
[20:04:01] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:04:04] <stashbot>	 T348688: FileBackendStore::ingestFreshFileStats: Could not stat file - https://phabricator.wikimedia.org/T348688
[20:04:08] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:04:24] <Kemayo>	 TheresNoTime: It seems to be working fine, thanks!
[20:04:30] <logmsgbot>	 !log samtar@deploy2002 samtar and kemayo: Continuing with sync
[20:04:45] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir2001.codfw.wmnet with OS bookworm
[20:04:47] <logmsgbot>	 !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[20:04:53] <logmsgbot>	 !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[20:04:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir2001.codfw.wmnet with OS bookworm
[20:05:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[20:05:20] <wikibugs>	 (03PS2) 10Samtar: Remove override to allow mobile edit notices to display on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965205 (https://phabricator.wikimedia.org/T316178) (owner: 10DLynch)
[20:07:35] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[20:07:38] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:08:00] <wikibugs>	 (03PS1) 10Jdlrobson: Beta cluster: mobile web click tracking schema at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965246 (https://phabricator.wikimedia.org/T346106)
[20:09:58] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:963084|Enable Edit Check on initial partner wikis (T347908)]] (duration: 07m 32s)
[20:10:05] <stashbot>	 T347908: [Config] Enable Edit Check (References) at initial partner wikis - https://phabricator.wikimedia.org/T347908
[20:10:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965205 (https://phabricator.wikimedia.org/T316178) (owner: 10DLynch)
[20:11:14] <wikibugs>	 (03Merged) 10jenkins-bot: Remove override to allow mobile edit notices to display on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965205 (https://phabricator.wikimedia.org/T316178) (owner: 10DLynch)
[20:11:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye
[20:11:27] <wikibugs>	 (03PS3) 10Samtar: InitialiseSettings-labs: Enable UrlShortenerEnableQrCode on all of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965240 (https://phabricator.wikimedia.org/T348487)
[20:11:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye
[20:11:39] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:965205|Remove override to allow mobile edit notices to display on all wikis (T316178)]]
[20:11:44] <stashbot>	 T316178: [Config Change] Make upstream mobile edit notice implementation available at all wikis - https://phabricator.wikimedia.org/T316178
[20:12:11] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye
[20:12:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with e...
[20:13:00] <logmsgbot>	 !log samtar@deploy2002 kemayo and samtar: Backport for [[gerrit:965205|Remove override to allow mobile edit notices to display on all wikis (T316178)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:13:04] <TheresNoTime>	 Kemayo: second patch live on mwdebug
[20:13:24] <Kemayo>	 Checking now.
[20:13:44] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:13:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye
[20:13:55] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye
[20:14:27] <Kemayo>	 TheresNoTime: Okay, looks good to deploy.
[20:14:33] <logmsgbot>	 !log samtar@deploy2002 kemayo and samtar: Continuing with sync
[20:16:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:19:18] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:19:57] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:965205|Remove override to allow mobile edit notices to display on all wikis (T316178)]] (duration: 08m 18s)
[20:20:02] <TheresNoTime>	 Kemayo: both live in prod :)
[20:20:02] <stashbot>	 T316178: [Config Change] Make upstream mobile edit notice implementation available at all wikis - https://phabricator.wikimedia.org/T316178
[20:20:09] <Kemayo>	 TheresNoTime: great, thanks!
[20:21:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:21:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965240 (https://phabricator.wikimedia.org/T348487) (owner: 10Samtar)
[20:22:04] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir2001.codfw.wmnet with reason: host reimage
[20:22:33] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable UrlShortenerEnableQrCode on all of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965240 (https://phabricator.wikimedia.org/T348487) (owner: 10Samtar)
[20:24:21] <wikibugs>	 (03PS1) 10Bking: flink-zk: Permit traffic from STAGING_KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/965248 (https://phabricator.wikimedia.org/T347075)
[20:24:48] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir2001.codfw.wmnet with reason: host reimage
[20:25:03] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] flink-zk: Permit traffic from STAGING_KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/965248 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking)
[20:26:34] <wikibugs>	 (03CR) 10Bking: [C: 03+2] flink-zk: Permit traffic from STAGING_KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/965248 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking)
[20:36:46] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus-streaming-updater: Correctly define the entry class [deployment-charts] - 10https://gerrit.wikimedia.org/r/965249 (https://phabricator.wikimedia.org/T347075)
[20:38:16] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cirrus-streaming-updater: Correctly define the entry class [deployment-charts] - 10https://gerrit.wikimedia.org/r/965249 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[20:39:31] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cirrus-streaming-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/964928 (owner: 10Ebernhardson)
[20:39:41] <icinga-wm>	 RECOVERY - MD RAID on ganeti1022 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:40:06] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir2001.codfw.wmnet with OS bookworm
[20:40:16] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir2001.codfw.wmnet with OS bookworm completed: - ncredir2001 (**WARN**)   - Downtimed on Ici...
[20:40:20] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/964928 (owner: 10Ebernhardson)
[20:40:22] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: Correctly define the entry class [deployment-charts] - 10https://gerrit.wikimedia.org/r/965249 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[20:41:07] <Kizule>	 TheresNoTime: Sorry for not responding earlier. Can you check Logstash for Aquaman and the Lost Kingdom logo.jpg on Serbian Wikipedia? I just tried to delete it, and I got a generic error that deleting isn't possible because of local-swift-eqiad?
[20:43:42] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[20:44:06] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:45:40] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[20:45:48] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:49:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:53:47] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[20:54:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1002.eqiad.wmnet with OS bookworm
[20:54:19] <taavi>	 jouncebot: nowandnext
[20:54:19] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T2000)
[20:54:19] <jouncebot>	 In 0 hour(s) and 5 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T2100)
[20:54:19] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir1002.eqiad.wmnet with OS bookworm
[20:54:33] <wikibugs>	 (03PS1) 10Majavah: Set WRITE_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965250 (https://phabricator.wikimedia.org/T242031)
[20:54:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965250 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[20:55:30] <wikibugs>	 (03Merged) 10jenkins-bot: Set WRITE_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965250 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[20:55:53] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:965250|Set WRITE_NEW for CA wikis on OATHAuth multiple devices (T242031)]]
[20:55:58] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[20:57:13] <logmsgbot>	 !log taavi@deploy2002 taavi: Backport for [[gerrit:965250|Set WRITE_NEW for CA wikis on OATHAuth multiple devices (T242031)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T2100)
[21:01:05] <logmsgbot>	 !log taavi@deploy2002 taavi: Continuing with sync
[21:04:22] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:04:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:06:27] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:965250|Set WRITE_NEW for CA wikis on OATHAuth multiple devices (T242031)]] (duration: 10m 33s)
[21:06:38] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[21:07:17] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage
[21:09:55] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage
[21:11:12] <ryankemper>	 !log T348418 Rebooting `apifeatureusage1001.eqiad.wmnet`
[21:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:16] <stashbot>	 T348418: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418
[21:15:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ifup@ens13.service Failed on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:16:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:29] <ryankemper>	 ^ Had set a downtime on icinga but not alertmanager. The apifeatureusage1001 alert should resolve soon with the host back online
[21:20:42] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on apifeatureusage2001.codfw.wmnet with reason: reboot T348418
[21:20:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) ifup@ens13.service Failed on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:20:44] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on apifeatureusage2001.codfw.wmnet with reason: reboot T348418
[21:20:46] <stashbot>	 T348418: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418
[21:23:35] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:26:04] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir1002.eqiad.wmnet with OS bookworm
[21:26:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir1002.eqiad.wmnet with OS bookworm completed: - ncredir1002 (**WARN**)   - Downtimed on Ici...
[21:30:45] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1001.eqiad.wmnet with OS bookworm
[21:30:50] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.remove-downtime for apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet
[21:30:50] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet
[21:30:57] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir1001.eqiad.wmnet with OS bookworm
[21:38:52] <wikibugs>	 10SRE, 10Growth-Team, 10MW-on-K8s, 10MediaWiki-Platform-Team, and 5 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10KStoller-WMF)
[21:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[21:43:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:47:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage
[21:48:59] <wikibugs>	 (03PS3) 10Bking: wdqs: bring graph split hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505)
[21:49:01] <wikibugs>	 (03PS13) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[21:49:37] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage
[21:51:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[21:58:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:02:08] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus-streaming-updater: Enable s3 for state storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/965256 (https://phabricator.wikimedia.org/T347075)
[22:02:50] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus-streaming-updater: Enable s3 for state storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/965256 (https://phabricator.wikimedia.org/T347075)
[22:05:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:05:21] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir1001.eqiad.wmnet with OS bookworm
[22:05:32] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir1001.eqiad.wmnet with OS bookworm completed: - ncredir1001 (**WARN**)   - Downtimed on Ici...
[22:06:12] <wikibugs>	 (03PS14) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[22:06:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[22:06:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[22:06:58] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[22:08:28] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus-streaming-updater: Enable s3 for state storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/965256 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[22:09:11] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: Enable s3 for state storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/965256 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[22:10:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[22:12:23] <wikibugs>	 (03PS15) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[22:12:45] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[22:13:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:15:31] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:15:42] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:18:02] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:18:12] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:18:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:46:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye
[22:47:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye
[23:05:41] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-be2003.codfw.wmnet with OS bullseye
[23:05:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with e...
[23:05:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Papaul) @MatthewVernon sorry to hear that you are having some issue with this server.  I was able to set all the disks as JBOD like you asked. However...
[23:06:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[23:09:05] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye
[23:09:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye
[23:22:25] <logmsgbot>	 !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye
[23:23:47] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye
[23:23:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye
[23:41:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:46:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:59:15] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state