[00:34:52] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:38:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/964634 [00:38:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/964634 (owner: 10TrainBranchBot) [00:46:12] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:12] (03PS5) 10Cathal Mooney: Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) [00:53:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/964634 (owner: 10TrainBranchBot) [00:54:58] 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10matmarex) For what it's worth, there's plenty of error logging indicating that Math has trouble contacting RESTBase: https://l... [02:01:02] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1104 [02:02:23] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1104 [02:03:24] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1104.mgmt.eqiad.wmnet with reboot policy FORCED [02:15:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T343198)', diff saved to https://phabricator.wikimedia.org/P52892 and previous config saved to /var/cache/conftool/dbconfig/20231011-021513-arnaudb.json [02:15:18] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [02:18:30] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:51] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1104.mgmt.eqiad.wmnet with reboot policy FORCED [02:27:46] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [02:27:54] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [02:30:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P52893 and previous config saved to /var/cache/conftool/dbconfig/20231011-023019-arnaudb.json [02:38:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:44:32] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P52894 and previous config saved to /var/cache/conftool/dbconfig/20231011-024526-arnaudb.json [02:49:26] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [02:49:34] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:00:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T343198)', diff saved to https://phabricator.wikimedia.org/P52895 and previous config saved to /var/cache/conftool/dbconfig/20231011-030032-arnaudb.json [03:00:35] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [03:00:42] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [03:00:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [03:00:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52896 and previous config saved to /var/cache/conftool/dbconfig/20231011-030054-arnaudb.json [03:03:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:33:32] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:36:26] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:36:54] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:36:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:41] (03PS2) 10KartikMistry: Update cxserver to 2023-10-11-045323-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/964846 (https://phabricator.wikimedia.org/T341478) [05:07:51] * kart_ deploying cxserver.. [05:08:10] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-10-11-045323-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/964846 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry) [05:09:02] (03Merged) 10jenkins-bot: Update cxserver to 2023-10-11-045323-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/964846 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry) [05:10:41] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:11:03] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:18:55] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:19:29] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:21:04] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:21:33] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:23:54] !log Updated cxserver to 2023-10-11-045323-production (T341478, T344982, T338432, T347939) [05:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:03] T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939 [05:24:03] T344982: Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 [05:24:03] T341478: Port the markup transfer feature of cxserver to MinT - https://phabricator.wikimedia.org/T341478 [05:24:04] T338432: Prepare the cxserver for usage without RESTbase - https://phabricator.wikimedia.org/T338432 [05:25:13] Looks like cxserver is down. Checking. [05:39:42] (03PS1) 10KartikMistry: cxserver: Fix restbase path [deployment-charts] - 10https://gerrit.wikimedia.org/r/965019 [05:40:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:40:44] (03PS1) 10KartikMistry: Revert "Update cxserver to 2023-10-11-045323-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964603 [05:41:05] (03Abandoned) 10KartikMistry: cxserver: Fix restbase path [deployment-charts] - 10https://gerrit.wikimedia.org/r/965019 (owner: 10KartikMistry) [05:42:41] (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2023-10-11-045323-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964603 (owner: 10KartikMistry) [05:43:24] (03Merged) 10jenkins-bot: Revert "Update cxserver to 2023-10-11-045323-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964603 (owner: 10KartikMistry) [05:44:16] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:44:32] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:45:04] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:45:38] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:45:55] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:46:17] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:50:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:51:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:55:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:58:37] (03PS1) 10KartikMistry: Update cxserver to 2023-10-11-045323-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T341478) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T0600) [06:11:23] (03PS1) 10Slyngshede: P:monitoring remove remnants of dpkg monitoring [puppet] - 10https://gerrit.wikimedia.org/r/965024 (https://phabricator.wikimedia.org/T332764) [06:37:01] (03CR) 10Elukey: team-ml: add alert for Kafka consumer lag for ores extension (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [06:38:14] (03CR) 10Elukey: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [06:38:33] (03CR) 10Elukey: [C: 03+1] cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [06:43:10] (03CR) 10Elukey: "Forgot also one thing - we can add test fixtures, see in other directories how it is done (basically you add a _test.yaml file etc..)." [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [06:43:16] (03CR) 10Elukey: "Forgot also one thing - we can add test fixtures, see in other directories how it is done (basically you add a _test.yaml file etc..)." [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [06:45:02] (03CR) 10Elukey: [C: 03+1] ml-services: test kserve batcher for revertrisk-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/964915 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [06:51:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:53:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:54] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:00] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:59:56] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:00:05] Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T0700). Please do the needful. [07:00:05] sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:34] hi [07:01:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:03:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:03:33] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:11:17] jouncebot: now [07:11:17] For the next 0 hour(s) and 48 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T0700) [07:11:39] sergi0: good morning, I guess I will do the deployments :] [07:11:59] hashar: I was about to start myself, as you wish :) [07:13:05] *good morning :) [07:13:24] oh if you know how to deploy please go ahead! [07:13:36] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 (owner: 10EoghanGaffney) [07:13:38] (03CR) 10Elukey: "Left a comment about build vs runtime OS. Another qs - have you tried to run docker-pkg locally to build the new image? To verify errors e" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková) [07:13:44] sure, starting [07:14:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964929 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno) [07:14:03] I have added a patch to this window to unblock the mediawiki train which I will run in ~ 45 minutes ( https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/964600/ ) [07:14:11] and will deploy it once you are done ;) [07:14:38] alright [07:14:41] (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink frontend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964929 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno) [07:15:43] !log sgimeno@deploy2002 Started scap: Backport for [[gerrit:964929|GrowthExperiments: enable AddLink frontend 14th round of wikis (T308139)]] [07:15:48] T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 [07:17:10] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:964929|GrowthExperiments: enable AddLink frontend 14th round of wikis (T308139)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:19:00] !log sgimeno@deploy2002 sgimeno: Continuing with sync [07:24:48] !log sgimeno@deploy2002 Finished scap: Backport for [[gerrit:964929|GrowthExperiments: enable AddLink frontend 14th round of wikis (T308139)]] (duration: 09m 05s) [07:24:52] T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 [07:25:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno) [07:25:42] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend 15th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141) [07:26:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno) [07:26:54] (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend 15th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno) [07:27:16] !log sgimeno@deploy2002 Started scap: Backport for [[gerrit:964949|GrowthExperiments: enable AddLink backend 15th round of wikis (T308141)]] [07:27:20] T308141: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 [07:27:54] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: add listener for mw-api in the rec-api-ng's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) (owner: 10Elukey) [07:28:35] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:964949|GrowthExperiments: enable AddLink backend 15th round of wikis (T308141)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:29:19] !log sgimeno@deploy2002 sgimeno: Continuing with sync [07:29:29] (03CR) 10Elukey: [C: 03+2] ml-services: add listener for mw-api in the rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) (owner: 10Elukey) [07:32:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [07:35:01] !log sgimeno@deploy2002 Finished scap: Backport for [[gerrit:964949|GrowthExperiments: enable AddLink backend 15th round of wikis (T308141)]] (duration: 07m 45s) [07:35:07] T308141: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 [07:35:14] (03CR) 10Hashar: "I have commented on the task ( T340788#8991308 ) that the httpb tests should probably exercise the whole stack (ATS/Varnish caches > Envoy" [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) (owner: 10EoghanGaffney) [07:35:33] hashar: finished my patches [07:35:37] great :) [07:35:45] (03CR) 10Hashar: [C: 03+2] Move @font-size-base into mediawiki.skin.variables.less [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964601 (https://phabricator.wikimedia.org/T348572) (owner: 10Jdlrobson) [07:35:46] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:35:53] (03CR) 10Hashar: [C: 03+2] Fixes Echo skin style for user message bar [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964600 (https://phabricator.wikimedia.org/T348530) (owner: 10Jdlrobson) [07:36:02] I am doing the backports for Vector [07:36:14] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10MoritzMuehlenhoff) [07:37:03] (03PS1) 10Muehlenhoff: Add aqu to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/965049 (https://phabricator.wikimedia.org/T347296) [07:37:17] (03CR) 10CI reject: [V: 04-1] Add aqu to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/965049 (https://phabricator.wikimedia.org/T347296) (owner: 10Muehlenhoff) [07:38:36] (03PS2) 10Muehlenhoff: Add aqu to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/965049 (https://phabricator.wikimedia.org/T347296) [07:39:20] (03CR) 10DCausse: [C: 03+1] cirrus-streaming-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/964928 (owner: 10Ebernhardson) [07:39:39] (03CR) 10DCausse: admin: Add cirrus-streaming-updater namespace to flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/964567 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [07:44:30] (03CR) 10Muehlenhoff: [C: 03+2] Add aqu to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/965049 (https://phabricator.wikimedia.org/T347296) (owner: 10Muehlenhoff) [07:45:19] 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 5 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Clement_Goubert) [07:50:36] (03Merged) 10jenkins-bot: Move @font-size-base into mediawiki.skin.variables.less [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964601 (https://phabricator.wikimedia.org/T348572) (owner: 10Jdlrobson) [07:50:44] (03Merged) 10jenkins-bot: Fixes Echo skin style for user message bar [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964600 (https://phabricator.wikimedia.org/T348530) (owner: 10Jdlrobson) [07:54:53] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 3 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Antoine_Quhen I've enabled your access on th... [07:57:28] (03CR) 10JMeybohm: "Sorry for volunteering you Hugh - I might be missing something here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry) [08:00:02] !log hashar@deploy2002 Synchronized php-1.41.0-wmf.30/skins/Vector: Backports for Vector styling issues T348572 T348530 (duration: 06m 16s) [08:00:05] hashar and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T0800). [08:00:17] T348572: Wrong font size in OOUI dropdowns in Vector - https://phabricator.wikimedia.org/T348572 [08:00:18] T348530: Less_Exception_Compiler: variable @min-width-desktop-wide is undefined in file /srv/mediawiki/php-1.41.0-wmf.30/skins/Vector/skinStyles/ext.echo.styles.alert.less in ext.echo.styles.alert.less on line 22, column 2320 - https://phabricator.wikimedia.org/T348530 [08:08:40] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [08:08:44] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [08:15:50] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [08:15:54] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [08:16:19] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [08:30:38] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965053 (https://phabricator.wikimedia.org/T347081) [08:30:40] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965053 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [08:31:23] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965053 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [08:33:15] andre and I are running the MediaWiki train [08:36:09] (03PS1) 10JMeybohm: admin_ng: Add namespace for wikifunctions mediawiki deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) [08:36:11] (03PS1) 10JMeybohm: Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) [08:38:23] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.30 refs T347081 [08:38:27] T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081 [08:39:15] (03PS1) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) [08:40:04] (03PS1) 10AikoChou: ml-services: upgrade kserve to 0.11.1 for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/965057 (https://phabricator.wikimedia.org/T347550) [08:42:20] (03CR) 10Ayounsi: [C: 03+1] "One additional verification needed then lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [08:44:24] !log hashar@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.30 refs T347081 (duration: 06m 00s) [08:44:27] (03CR) 10Muehlenhoff: idp: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [08:44:27] T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081 [08:44:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [08:45:23] (03CR) 10Elukey: [C: 03+1] ml-services: upgrade kserve to 0.11.1 for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/965057 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [08:45:28] (03CR) 10Filippo Giunchedi: [C: 03+1] P:monitoring remove remnants of dpkg monitoring [puppet] - 10https://gerrit.wikimedia.org/r/965024 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [08:47:03] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [08:47:44] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) p:05Triage→03High [08:49:07] (03CR) 10AikoChou: [C: 03+2] ml-services: upgrade kserve to 0.11.1 for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/965057 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [08:49:09] (03CR) 10Muehlenhoff: [C: 03+2] idp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [08:50:11] (03Merged) 10jenkins-bot: ml-services: upgrade kserve to 0.11.1 for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/965057 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [08:53:58] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:59:53] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) (owner: 10Cathal Mooney) [09:02:16] (03CR) 10Ayounsi: "Note that there is a merge conflict with I2764b25d3fc32d9b2ee2ecc5e6115f5a08427fcb I can't rebase this one on top of it." [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [09:02:24] (03CR) 10Ayounsi: [C: 03+1] YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [09:08:02] (03CR) 10Clément Goubert: ml-services: add listener for mw-api in the rec-api-ng's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) (owner: 10Elukey) [09:09:00] (03CR) 10Joal: [C: 03+1] "LGTM - I don't know about the inside changes of the image, but the iamge exists in the registry and I trust the fact that it has been test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964848 (https://phabricator.wikimedia.org/T343511) (owner: 10Elukey) [09:09:34] (03CR) 10Vgutierrez: wikifunctions: Add routing to separate mw-on-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert) [09:10:56] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:12:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:12:59] (03CR) 10Jbond: [C: 03+2] late_command: update puppet installation logic [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:15:02] (03PS1) 10Joal: Bump mediawiki_history_snapshot to 2023-09 [puppet] - 10https://gerrit.wikimedia.org/r/965059 [09:15:35] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [09:19:04] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [09:19:09] (03PS2) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) [09:19:24] (03CR) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert) [09:23:03] !log jayme@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add VIPs for mw-wikifunction - jayme@cumin1001" [09:23:52] !log jayme@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add VIPs for mw-wikifunction - jayme@cumin1001" [09:23:52] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:25:15] (03CR) 10JMeybohm: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert) [09:27:03] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) This happens again repeatedly with not so bi... [09:29:01] (03PS1) 10Jbond: late_command: add backwards compatible fallback: [puppet] - 10https://gerrit.wikimedia.org/r/965061 [09:29:51] (03PS1) 10JMeybohm: Add mw-wikifunctions records [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) [09:29:55] (03PS2) 10Jbond: late_command: add backwards compatible fallback: [puppet] - 10https://gerrit.wikimedia.org/r/965061 [09:31:15] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS bullseye [09:32:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/965061 (owner: 10Jbond) [09:32:50] (03CR) 10Vgutierrez: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert) [09:33:04] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965061 (owner: 10Jbond) [09:33:13] (03CR) 10Jbond: [C: 03+2] late_command: add backwards compatible fallback: [puppet] - 10https://gerrit.wikimedia.org/r/965061 (owner: 10Jbond) [09:34:34] 10SRE-swift-storage, 10Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10MatthewVernon) I've gone looking, and the problem is that only one swift cluster has this object: ` root@ms-fe1009:/etc/swift# s... [09:34:47] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [09:34:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10MoritzMuehlenhoff) >>! In T342537#9240999, @Papaul wrote: > looking at the gerrit history about the late command i see also that there where some changes m... [09:37:34] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:43:38] (03PS3) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) [09:44:17] (03CR) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert) [09:47:00] (03PS1) 10Jbond: bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 [09:47:54] (03PS2) 10Jbond: bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 [09:49:11] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [09:50:24] (03CR) 10CI reject: [V: 04-1] bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 (owner: 10Jbond) [09:52:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [09:52:34] !log rebuilding RAID after disk replacement T348429 [09:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:37] T348429: Broken disk on ganeti1022 - https://phabricator.wikimedia.org/T348429 [09:53:38] (03PS3) 10Jbond: bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 [09:54:03] (03CR) 10CI reject: [V: 04-1] bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 (owner: 10Jbond) [09:54:13] (03PS1) 10JMeybohm: Add mw-wikifunctions discovery records [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544) [09:54:30] (03PS1) 10JMeybohm: service::catalog: Add mw-wikifunctions - 1 [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544) [09:54:37] (03CR) 10Volans: "unrelated comments, but make sense to add them here I think" [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [09:55:35] (03PS4) 10Jbond: bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 [09:57:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43984/console" [puppet] - 10https://gerrit.wikimedia.org/r/965063 (owner: 10Jbond) [09:57:44] (03CR) 10JMeybohm: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert) [09:58:23] (03CR) 10CI reject: [V: 04-1] bird: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/965063 (owner: 10Jbond) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1000) [10:02:28] (03PS2) 10JMeybohm: Add mw-wikifunctions records [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) [10:02:30] (03PS2) 10JMeybohm: Add mw-wikifunctions discovery records [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544) [10:03:26] (03CR) 10JMeybohm: Add mw-wikifunctions records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:07:14] (03PS4) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) [10:07:53] (03CR) 10Clément Goubert: wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert) [10:08:14] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [10:09:02] (03CR) 10Volans: Add mw-wikifunctions records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:10:28] (03PS12) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [10:10:30] (03PS42) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) [10:11:56] (03CR) 10Clément Goubert: admin_ng: Add namespace for wikifunctions mediawiki deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:13:47] (03CR) 10Clément Goubert: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:14:30] (03PS1) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 [10:14:53] (03CR) 10Clément Goubert: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:15:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52897 and previous config saved to /var/cache/conftool/dbconfig/20231011-101545-arnaudb.json [10:15:50] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [10:16:49] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43985/console" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:17:15] (03CR) 10CI reject: [V: 04-1] profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [10:19:10] (03CR) 10Clément Goubert: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:22:21] (03PS1) 10Slyngshede: P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093 [10:22:35] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) I've spent some time checking, and... [10:22:46] (03CR) 10CI reject: [V: 04-1] P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede) [10:24:07] (03PS2) 10Slyngshede: P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093 [10:24:59] (03CR) 10Hashar: "For Zuul in production:" [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [10:25:34] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43986/console" [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede) [10:27:03] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43987/console" [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede) [10:28:35] (03CR) 10Slyngshede: [C: 03+2] P:monitoring remove remnants of dpkg monitoring [puppet] - 10https://gerrit.wikimedia.org/r/965024 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [10:30:32] (03CR) 10Muehlenhoff: P:idm Provide callback to test system. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede) [10:30:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P52898 and previous config saved to /var/cache/conftool/dbconfig/20231011-103052-arnaudb.json [10:30:57] (03PS1) 10Muehlenhoff: Fix mariadb restart behaviour on testreduce [puppet] - 10https://gerrit.wikimedia.org/r/965095 [10:31:14] (03PS2) 10Muehlenhoff: Fix mariadb restart behaviour on testreduce [puppet] - 10https://gerrit.wikimedia.org/r/965095 [10:31:40] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy keytabs for apt1002/apt2002 [labs/private] - 10https://gerrit.wikimedia.org/r/964900 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:33:13] (03PS3) 10Slyngshede: P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093 [10:34:17] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10fnegri) > new error popped up after rebooting > T348550 This seems to have resolved on its own? `/usr/local... [10:34:50] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43988/console" [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede) [10:36:14] (03CR) 10Slyngshede: [V: 03+1] P:idm Provide callback to test system. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede) [10:36:16] (03CR) 10Muehlenhoff: [C: 03+2] Fix mariadb restart behaviour on testreduce [puppet] - 10https://gerrit.wikimedia.org/r/965095 (owner: 10Muehlenhoff) [10:38:19] (03CR) 10Cathal Mooney: [C: 03+2] Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) (owner: 10Cathal Mooney) [10:38:34] (03CR) 10Hnowlan: [C: 04-1] "+1 on Janis's comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry) [10:38:53] (03Merged) 10jenkins-bot: Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) (owner: 10Cathal Mooney) [10:40:55] (03PS2) 10JMeybohm: admin_ng: Add namespace for wikifunctions mediawiki deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) [10:40:57] (03PS2) 10JMeybohm: Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) [10:40:59] (03PS32) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [10:41:01] (03PS13) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [10:41:03] (03PS43) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) [10:41:29] (03CR) 10JMeybohm: admin_ng: Add namespace for wikifunctions mediawiki deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:41:32] (03PS1) 10Jbond: late_command: need to update the target apt config [puppet] - 10https://gerrit.wikimedia.org/r/965096 [10:43:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede) [10:43:36] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm Provide callback to test system. [puppet] - 10https://gerrit.wikimedia.org/r/965093 (owner: 10Slyngshede) [10:44:11] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Now I get `00411: FAILED: stashfailed: An un... [10:44:17] (03CR) 10Jbond: [C: 03+2] late_command: need to update the target apt config [puppet] - 10https://gerrit.wikimedia.org/r/965096 (owner: 10Jbond) [10:45:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P52899 and previous config saved to /var/cache/conftool/dbconfig/20231011-104558-arnaudb.json [10:47:34] (03PS3) 10Stevemunene: druid: Bring druid1011.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962249 (https://phabricator.wikimedia.org/T336042) [10:47:54] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) [I'm afraid my previous comment sti... [10:48:04] (03CR) 10Daniel Kinzler: Update cxserver to 2023-10-11-045323-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T341478) (owner: 10KartikMistry) [10:48:52] (03PS1) 10FNegri: wmcs::cloudlb: add cloud_production profile [puppet] - 10https://gerrit.wikimedia.org/r/965098 [10:49:16] (03PS1) 10Samtar: InitialiseSettings-labs: Enable Edit Recovery on all beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965099 [10:50:14] jouncebot: nowandnext [10:50:14] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1000) [10:50:14] In 2 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1300) [10:50:31] (03CR) 10Samtar: [C: 03+2] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965099 (owner: 10Samtar) [10:51:14] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable Edit Recovery on all beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965099 (owner: 10Samtar) [10:52:31] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:52:41] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:53:48] (03PS33) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [10:53:50] (03PS14) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [10:53:52] (03PS44) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) [10:54:33] (03PS1) 10Kosta Harlan: labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) [10:55:47] (03PS1) 10Muehlenhoff: Assign apt_repo role to apt1002 [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) [10:57:17] (03PS4) 10Cathal Mooney: YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) [10:58:02] (03CR) 10Cathal Mooney: [C: 03+2] YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [10:58:37] (03Merged) 10jenkins-bot: YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [10:59:12] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8 DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43990/console" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:59:45] (03CR) 10Dreamy Jazz: [C: 03+1] labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) (owner: 10Kosta Harlan) [11:01:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52900 and previous config saved to /var/cache/conftool/dbconfig/20231011-110105-arnaudb.json [11:01:07] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [11:01:16] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:01:21] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [11:01:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52901 and previous config saved to /var/cache/conftool/dbconfig/20231011-110127-arnaudb.json [11:03:33] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:08:04] (03CR) 10Btullis: Support configuring the spark3 defaults with the default shuffler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:10:47] (03PS1) 10Hashar: zuul: move Gerrit key from merger to server [puppet] - 10https://gerrit.wikimedia.org/r/965103 [11:12:46] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:12:54] (03CR) 10Stevemunene: [C: 03+2] druid: Bring druid1011.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962249 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [11:12:56] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:13:16] (03CR) 10CI reject: [V: 04-1] zuul: move Gerrit key from merger to server [puppet] - 10https://gerrit.wikimedia.org/r/965103 (owner: 10Hashar) [11:14:07] 10SRE, 10Infrastructure-Foundations, 10netops: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney) 05Open→03Resolved Changes pushed to production, closing task. [11:18:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [11:18:30] (03CR) 10Jbond: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [11:20:25] (03PS34) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [11:20:27] (03PS15) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [11:20:29] (03PS45) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) [11:21:34] (03PS4) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) [11:22:38] (03PS1) 10Jbond: late_command: set certificate_revocation = leaf in puppet [puppet] - 10https://gerrit.wikimedia.org/r/965104 (https://phabricator.wikimedia.org/T340543) [11:23:07] (03CR) 10CI reject: [V: 04-1] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:23:21] (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [11:24:03] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:26:39] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Volans) 05Resolved→03Open FYI The service IPs in Netbox are still allocated to the service and probably needs cleanup: https://netbox.wikimedia.org/ipam... [11:27:09] (03PS1) 10Slyngshede: SUL account linking, display success message. [software/bitu] - 10https://gerrit.wikimedia.org/r/965105 [11:27:56] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on druid1011.eqiad.wmnet with reason: Downtime as we setup the host to join the druid and zookeper cluster [11:28:11] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on druid1011.eqiad.wmnet with reason: Downtime as we setup the host to join the druid and zookeper cluster [11:29:34] (03PS1) 10Hashar: zuul: get ssh key from Puppet collected resource [puppet] - 10https://gerrit.wikimedia.org/r/965106 [11:31:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:33] (03PS1) 10Slyngshede: P:idm use service_fqdn to link correctly. [puppet] - 10https://gerrit.wikimedia.org/r/965107 [11:32:07] (03CR) 10CI reject: [V: 04-1] zuul: get ssh key from Puppet collected resource [puppet] - 10https://gerrit.wikimedia.org/r/965106 (owner: 10Hashar) [11:36:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:39:02] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1029 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:47:08] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:51] * kart_ quickly updating cxserver (without RESTBase changes) [11:52:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:53:30] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43998/console" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:53:34] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] SUL account linking, display success message. [software/bitu] - 10https://gerrit.wikimedia.org/r/965105 (owner: 10Slyngshede) [11:54:10] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:54:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965107 (owner: 10Slyngshede) [11:54:14] (03CR) 10CI reject: [V: 04-1] puppet: add support for puppetserver returning none 0 rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [11:54:35] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:54:36] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm use service_fqdn to link correctly. [puppet] - 10https://gerrit.wikimedia.org/r/965107 (owner: 10Slyngshede) [11:57:20] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:58:07] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:58:16] (03CR) 10Jbond: [C: 03+1] "LGTM, q inline" [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [11:59:36] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:00:03] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:00:40] !log Updated cxserver to 2023-10-11-114410-production (T341478, T347939) [12:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:52] T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939 [12:00:53] T341478: Port the markup transfer feature of cxserver to MinT - https://phabricator.wikimedia.org/T341478 [12:03:10] (03PS1) 10Slyngshede: Add missing request parameter to message [software/bitu] - 10https://gerrit.wikimedia.org/r/965115 [12:03:47] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add missing request parameter to message [software/bitu] - 10https://gerrit.wikimedia.org/r/965115 (owner: 10Slyngshede) [12:06:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "CI failure is unrelated (I think bitly returning 401, works locally)" [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [12:07:16] (03PS1) 10Clément Goubert: aux-k8s-ctrl: Fix missing PTR record [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) [12:08:11] (03CR) 10Muehlenhoff: Assign apt_repo role to apt1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [12:09:05] (03PS1) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 [12:09:30] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1029 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:12:58] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [12:15:20] !log elukey@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ORES svc records - elukey@cumin1001" [12:16:03] (03CR) 10Volans: [C: 04-1] "Typo in the PTR" [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) (owner: 10Clément Goubert) [12:16:04] (03CR) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [12:16:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ORES svc records - elukey@cumin1001" [12:16:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:16:29] (03PS2) 10JMeybohm: service::catalog: Add mw-wikifunctions - 1 [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544) [12:16:31] (03PS1) 10JMeybohm: Add mw-wikifunctions to mediawiki k8s releases [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544) [12:16:37] (03PS1) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) [12:17:00] (03CR) 10Hashar: zuul: get ssh key from Puppet collected resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965106 (owner: 10Hashar) [12:17:24] (03CR) 10CI reject: [V: 04-1] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond) [12:18:58] (03PS2) 10Clément Goubert: aux-k8s-ctrl: Fix missing PTR record [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) [12:19:19] (03CR) 10Muehlenhoff: Fix wording (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 (owner: 10Slyngshede) [12:19:27] (03CR) 10Clément Goubert: aux-k8s-ctrl: Fix missing PTR record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) (owner: 10Clément Goubert) [12:19:32] (03CR) 10Jbond: "see comments in line, my change is failing pcc so there are still some bits to fix up but it shouldn't be too difficult to fix up" [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [12:20:56] (03PS2) 10Slyngshede: Update wording to read more clearly. [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 [12:21:16] (03CR) 10Slyngshede: Update wording to read more clearly. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 (owner: 10Slyngshede) [12:21:50] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) I really think that we need to find a solution for this. It has been pending for too long. Today I did a check of the dns repository a... [12:23:14] (03PS2) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 [12:23:23] (03CR) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [12:23:34] (03CR) 10Clément Goubert: [C: 03+1] Add mw-wikifunctions to mediawiki k8s releases [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [12:23:39] (03CR) 10CI reject: [V: 04-1] profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [12:23:49] (03PS2) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) [12:24:14] (03CR) 10CI reject: [V: 04-1] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond) [12:24:26] (03PS1) 10Filippo Giunchedi: test: dump response body on runbook fetch failure [alerts] - 10https://gerrit.wikimedia.org/r/965123 [12:25:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 (owner: 10Slyngshede) [12:25:53] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Update wording to read more clearly. [software/bitu] - 10https://gerrit.wikimedia.org/r/965120 (owner: 10Slyngshede) [12:27:16] (03PS1) 10Elukey: role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) [12:28:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "Apparently no more 401, at least not right now, merging anyways" [alerts] - 10https://gerrit.wikimedia.org/r/965123 (owner: 10Filippo Giunchedi) [12:28:21] (03CR) 10Filippo Giunchedi: [C: 03+2] test: dump response body on runbook fetch failure [alerts] - 10https://gerrit.wikimedia.org/r/965123 (owner: 10Filippo Giunchedi) [12:30:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44001/console" [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [12:31:30] (03PS2) 10Elukey: role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) [12:32:16] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) (owner: 10Clément Goubert) [12:33:14] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [12:34:34] (03PS2) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) [12:34:52] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:34:53] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [12:37:12] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Cleanup decommissioned services apple-search and graphoid - cgoubert@cumin1001" [12:38:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Cleanup decommissioned services apple-search and graphoid - cgoubert@cumin1001" [12:38:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:12] (03CR) 10Clément Goubert: [C: 03+2] aux-k8s-ctrl: Fix missing PTR record [dns] - 10https://gerrit.wikimedia.org/r/965117 (https://phabricator.wikimedia.org/T348632) (owner: 10Clément Goubert) [12:39:56] (03CR) 10Klausman: [C: 03+1] role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [12:41:01] (03CR) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [12:42:31] 10SRE, 10Discovery-Search, 10collaboration-services, 10serviceops, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) 05Open→03Resolved Done [12:42:50] (03CR) 10JMeybohm: [C: 03+1] kube-state-metrics: create image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková) [12:43:22] (03PS3) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 [12:43:36] (03PS1) 10AikoChou: ml-services: update revertrisk-la docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965146 (https://phabricator.wikimedia.org/T347550) [12:44:08] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Done [12:44:25] (03PS1) 10Clément Goubert: tegola-vector-tiles: Fix missing PTR [dns] - 10https://gerrit.wikimedia.org/r/965147 (https://phabricator.wikimedia.org/T348631) [12:44:45] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Add namespace for wikifunctions mediawiki deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [12:45:57] (03CR) 10CI reject: [V: 04-1] profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [12:47:03] (03Merged) 10jenkins-bot: admin_ng: Add namespace for wikifunctions mediawiki deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/965054 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [12:48:35] (03PS3) 10JMeybohm: Add mw-wikifunctions records [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) [12:51:21] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:51:55] (03PS3) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) [12:52:00] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:52:08] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:52:11] (03CR) 10Jbond: "FYI i updated the following to include this" [puppet] - 10https://gerrit.wikimedia.org/r/965103 (owner: 10Hashar) [12:52:20] (03CR) 10CI reject: [V: 04-1] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond) [12:53:45] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:53:47] (03PS1) 10Cathal Mooney: Add puppet elements for newly added switches. [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) [12:53:55] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:53:58] (03CR) 10Jbond: "this approach is fine but would still leave a bit of duplication in the labs profile" [puppet] - 10https://gerrit.wikimedia.org/r/965106 (owner: 10Hashar) [12:54:19] (03CR) 10CI reject: [V: 04-1] Add puppet elements for newly added switches. [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [12:55:14] (03CR) 10Jbond: ci: add Gerrit ssh key to ssh_known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [12:55:33] (03PS2) 10Cathal Mooney: Add puppet elements for newly added switches. [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) [12:55:39] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:56:05] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:56:46] (03PS1) 10Slyngshede: Minor styling updates [software/bitu] - 10https://gerrit.wikimedia.org/r/965150 [12:56:54] (03CR) 10Ssingh: [V: 03+1] hiera: announce ns1 IP from bird (codfw) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [12:56:56] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [12:57:56] (03CR) 10Jbond: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [12:58:27] (03CR) 10Clément Goubert: Add mediawiki deployment for wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [12:58:33] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Minor styling updates [software/bitu] - 10https://gerrit.wikimedia.org/r/965150 (owner: 10Slyngshede) [12:58:52] (03PS4) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) [12:59:18] (03CR) 10CI reject: [V: 04-1] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond) [12:59:20] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1300). [13:00:05] TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:54] * TheresNoTime is going to remove that [13:02:06] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:02:28] (03PS5) 10Jbond: gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) [13:02:49] (03CR) 10JMeybohm: Add mediawiki deployment for wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:03:25] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10aborrero) [13:05:34] (03PS3) 10JMeybohm: Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) [13:06:24] (03PS1) 10Elukey: api-gateway: add Content-type in the CORS' allowed headers settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/965153 (https://phabricator.wikimedia.org/T348511) [13:06:28] (03CR) 10JMeybohm: Add mediawiki deployment for wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:06:30] (03CR) 10Clément Goubert: [C: 03+1] Add mediawiki deployment for wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:07:28] (03PS4) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 [13:07:40] (03CR) 10JMeybohm: [C: 03+2] Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:07:47] (03CR) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [13:08:36] (03Merged) 10jenkins-bot: Add mediawiki deployment for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965055 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:09:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [13:10:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [13:11:50] (03CR) 10Ayounsi: [C: 03+1] hiera: announce ns1 IP from bird (codfw) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [13:12:20] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM! Thanks for spotting this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965153 (https://phabricator.wikimedia.org/T348511) (owner: 10Elukey) [13:13:19] (03PS1) 10Majavah: team-wmcs: ceph: cleanup summaries of existing alerts [alerts] - 10https://gerrit.wikimedia.org/r/965154 [13:13:21] (03PS1) 10Majavah: team-wmcs: ceph: add alert for slow ops [alerts] - 10https://gerrit.wikimedia.org/r/965155 [13:13:46] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [13:14:41] (03CR) 10Ayounsi: [C: 03+1] "Might be worth running PCC on alert1001 and install1004 just in case." [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [13:14:43] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [13:14:44] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [13:15:55] !log jbond@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [13:16:07] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [13:16:17] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [13:16:33] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [13:18:26] (03CR) 10Klausman: [C: 03+1] api-gateway: add Content-type in the CORS' allowed headers settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/965153 (https://phabricator.wikimedia.org/T348511) (owner: 10Elukey) [13:18:43] (03PS2) 10JMeybohm: Add mw-wikifunctions to mediawiki k8s releases [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544) [13:18:45] (03PS3) 10JMeybohm: service::catalog: Add mw-wikifunctions - 1 [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544) [13:18:47] (03PS1) 10JMeybohm: deployment_server: Add mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/965156 (https://phabricator.wikimedia.org/T347544) [13:20:00] (03CR) 10JMeybohm: [C: 03+2] deployment_server: Add mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/965156 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:23:32] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [13:24:03] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:24:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:24:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @MoritzMuehlenhoff thanks [13:24:58] !log starting decommission of restbase2012-a — T328490 [13:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:01] T328490: restbase cluster: decommission end-of-life hosts - https://phabricator.wikimedia.org/T328490 [13:25:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2497 [13:25:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2497 [13:26:02] (03PS1) 10Jelto: gitlab_runner: block dockerhub on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/965157 (https://phabricator.wikimedia.org/T320730) [13:26:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6368 [13:26:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6368 [13:26:51] TheresNoTime: is the backport window still sufficiently open that I could sneak something in, or should I wait for the next one? [13:27:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9031 [13:27:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9031 [13:27:27] Kemayo: go ahead if you can deploy :) [13:27:49] I cannot deploy, unfortunately. [13:27:58] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [13:28:12] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Enable logging of remote IPs. [puppet] - 10https://gerrit.wikimedia.org/r/963258 (owner: 10Slyngshede) [13:28:38] !log disable puppet on P:bird::anycast [13:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:42] !log disable puppet on P:bird::anycast: T348041 [13:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:46] T348041: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 [13:28:48] (03PS4) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) [13:29:20] Kemayo: I'm away from my laptop, what did you want to get deployed? [13:30:01] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44004/console" [puppet] - 10https://gerrit.wikimedia.org/r/965157 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto) [13:30:04] (03CR) 10Muehlenhoff: [C: 03+1] "The SSH key management module works fine and is ready to go live (also tested email changes and implicitly the new theming), let's update " [software/bitu] - 10https://gerrit.wikimedia.org/r/959211 (owner: 10Slyngshede) [13:30:11] (03PS1) 10JMeybohm: Remove namespace quota and limitranger from mw-wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965158 (https://phabricator.wikimedia.org/T347544) [13:30:19] TheresNoTime: I had a config change and a backport of a change to VE. If it can't happen right now, that's fine, I can make it to the late window. [13:30:36] Probably best, sorry! :) [13:30:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38195 [13:30:55] TheresNoTime: 👍🏻 [13:31:22] I’m around if needed [13:31:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38195 [13:31:49] jouncebot: next [13:31:49] In 0 hour(s) and 28 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1400) [13:31:49] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: announce ns1 IP from bird (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [13:31:51] hm [13:32:08] Kemayo: how closely related are the config change and backport? I’m not sure there’s time for both [13:32:37] but I could probably deploy one at least, if that’s useful [13:32:51] (03CR) 10JMeybohm: [C: 03+2] Remove namespace quota and limitranger from mw-wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965158 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:34:09] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [13:34:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 40317 [13:35:23] (03Merged) 10jenkins-bot: Remove namespace quota and limitranger from mw-wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/965158 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:35:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40317 [13:36:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38628 [13:36:12] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:36:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38628 [13:37:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 150552 [13:37:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 150552 [13:37:32] Lucas_WMDE: sadly they both need to go in. The backport could go without the config because it won’t actually be active without it, I guess… [13:37:34] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:37:53] eh, I could still do the backport then [13:37:57] so the late window goes faster ^^ [13:37:58] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:38:00] wdyt? [13:38:08] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) 05Resolved→03Open [13:38:29] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) @Papaul can I loop you in here, please? You've previously managed to successfully configure hardware like this as JBOD, but it seems to... [13:38:30] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:38:32] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:39:10] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10MatthewVernon) 05Resolved→03Open [re-opening this as the JBOD issue still needs resolving, similar to T342674] [13:39:43] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:39:44] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:40:17] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:40:45] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [13:41:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10MatthewVernon) a:05MatthewVernon→03None [13:41:34] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10MatthewVernon) [13:41:43] (03PS1) 10Muehlenhoff: Failover testreduce to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/965163 (https://phabricator.wikimedia.org/T345220) [13:42:04] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [dns] - 10https://gerrit.wikimedia.org/r/965147 (https://phabricator.wikimedia.org/T348631) (owner: 10Clément Goubert) [13:42:15] (03PS2) 10Muehlenhoff: Failover testreduce to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/965163 (https://phabricator.wikimedia.org/T345220) [13:42:40] Lucas_WMDE: sure, that works! They’re in the late backport window on Deployments now, if you want to grab the commands. [13:42:43] !log restart kube-apiserver on ml-serve-ctrl1001 as attempt to clear a weird golang/protobuf issue while retrieving secrets [13:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:46] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10MatthewVernon) moss-be1003 is now in place (cf T342675) so could the NVME card be installed please @Jclark-ctr ? [13:43:03] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10MatthewVernon) a:05MatthewVernon→03None [13:43:13] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:18] (03PS1) 10Slyngshede: P:idm improve apache2 logging. [puppet] - 10https://gerrit.wikimedia.org/r/965165 [13:43:45] Kemayo: okay! can the backport alone be tested? [13:43:46] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, and 2 others: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10Jdforrester-WMF) [13:44:03] Lucas_WMDE: this one, to be specific: https://gerrit.wikimedia.org/r/c/963042/ [13:44:10] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10MatthewVernon) moss-be2003 is now on site (cf T342674) so could this NVME card now be installed, please, @Jhancock.wm ? [13:44:12] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Enable SSH key management for all users. [software/bitu] - 10https://gerrit.wikimedia.org/r/959211 (owner: 10Slyngshede) [13:44:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963042 (owner: 10DLynch) [13:44:39] (just wondering whether I should wait for your confirmation when it’s on the test servers, or sync it directly) [13:45:06] also I’m guessing the wmf.28 backport is obsolete now :) [13:45:25] !log restart kube-apiserver on ml-serve-ctrl1002 [13:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:54] Yeah, this was all originally [13:46:04] Going to be deployed last week, and so… [13:46:32] ok, I see [13:46:46] But yes. Sync it directly — there’s no way for me to actually test it without the config patch also being out. [13:46:55] ack, thanks! [13:46:59] I’ll do that then [13:47:04] and good luck tonight ^^ [13:47:35] Thanks for the help! [13:48:31] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) If it can be useful, I generated a summary of `Offline_Uncorrectable` sectors per host: https://phabricator.wikimedia.org/P52907 [13:49:12] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/964506 (owner: 10L10n-bot) [13:49:14] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44005/console" [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [13:50:52] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [13:53:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Jhancock.wm) @MatthewVernon the card is installed. [13:54:49] (03PS1) 10Ayounsi: set anycast4 orlonger instead of longer [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) [13:55:16] 10SRE, 10ops-eqiad: Broken disk on ganeti1022 - https://phabricator.wikimedia.org/T348429 (10Jclark-ctr) 05Open→03Resolved [13:55:56] (03CR) 10Muehlenhoff: [C: 03+2] Assign apt_repo role to apt1002 [puppet] - 10https://gerrit.wikimedia.org/r/965101 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [13:56:15] (03CR) 10Ssingh: [C: 03+1] "FWIW :)" [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) (owner: 10Ayounsi) [13:56:18] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) (owner: 10Ayounsi) [13:56:30] (03CR) 10Ayounsi: [C: 03+2] set anycast4 orlonger instead of longer [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) (owner: 10Ayounsi) [13:56:53] 10SRE, 10observability, 10SRE Observability (FY2023/2024-Q2): Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10herron) 05Open→03Resolved a:03herron Done! [13:57:07] (03Merged) 10jenkins-bot: set anycast4 orlonger instead of longer [homer/public] - 10https://gerrit.wikimedia.org/r/965169 (https://phabricator.wikimedia.org/T348041) (owner: 10Ayounsi) [13:58:17] (03Merged) 10jenkins-bot: Edit check: Simplify "experience" config to "maximumEditcount" [extensions/VisualEditor] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963042 (owner: 10DLynch) [13:58:33] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:58:51] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:58:57] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10MatthewVernon) [13:58:58] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963042|Edit check: Simplify "experience" config to "maximumEditcount"]] [13:59:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:59:11] * Lucas_WMDE acks TheresNoTime’s beta-only change on deploy2002 [13:59:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Oh, yes, so it is, sorry. [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1400) [14:00:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kemayo: Backport for [[gerrit:963042|Edit check: Simplify "experience" config to "maximumEditcount"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:00:22] I’m still deploying, sorry [14:00:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kemayo: Continuing with sync [14:00:36] maybe ~4 more minutes or so [14:01:25] No worries from me [14:01:51] Kemayo: that was mainly directed at the people doing the Wikifunction Services window [14:02:08] (if the window had any IRC nicks in it I could ping them to let them know they’re not yet free to go…) [14:02:23] 🤔 [14:03:17] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T348550 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:05:29] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:06:02] (03CR) 10Klausman: [C: 03+1] ml-services: update revertrisk-la docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965146 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [14:06:11] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963042|Edit check: Simplify "experience" config to "maximumEditcount"]] (duration: 07m 13s) [14:06:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:06:17] * Lucas_WMDE done [14:06:24] if anyone wants to deploy wikifunctions services now :) [14:06:35] (03CR) 10JMeybohm: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [14:07:14] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:07:20] I would like to do something different what would interfere with mw deployments. So I'll thankfully take the headsup :) [14:09:00] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Add mw-wikifunctions to mediawiki k8s releases [puppet] - 10https://gerrit.wikimedia.org/r/965121 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [14:09:48] (03CR) 10Kamila Součková: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková) [14:10:28] (03CR) 10Elukey: [C: 03+1] kube-state-metrics: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková) [14:10:49] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) I opened up a ticket with dell for 1 server right now Confirmed: Service Request 177592506 was successfully submitted. [14:13:07] !log jayme@deploy2002 Started scap: (no justification provided) [14:15:22] !log jayme@deploy2002 Finished scap: (no justification provided) (duration: 02m 15s) [14:16:49] (03PS1) 10Muehlenhoff: Extend acmechief config for new apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/965170 (https://phabricator.wikimedia.org/T331613) [14:17:01] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:18:05] (03CR) 10JMeybohm: [C: 03+2] Add mw-wikifunctions records [dns] - 10https://gerrit.wikimedia.org/r/965062 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [14:18:15] !log installing curl security updates on bullseye/bookworm [14:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:05] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:21:07] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:21:13] (03Abandoned) 10DLynch: Edit check: Simplify "experience" config to "maximumEditcount" [extensions/VisualEditor] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963041 (owner: 10DLynch) [14:21:17] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:22:37] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1101'] [14:23:05] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, and 2 others: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) [14:23:48] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:24:07] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:24:31] (03PS2) 10Clément Goubert: tegola-vector-tiles: Fix missing PTR [dns] - 10https://gerrit.wikimedia.org/r/965147 (https://phabricator.wikimedia.org/T348631) [14:25:06] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:25:16] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:25:17] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:25:30] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:26:11] (03PS1) 10Hnowlan: rest-gateway: correct paths for edit, editor and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/965173 (https://phabricator.wikimedia.org/T347027) [14:27:43] (03PS1) 10Muehlenhoff: Move restbase canary [puppet] - 10https://gerrit.wikimedia.org/r/965174 (https://phabricator.wikimedia.org/T328490) [14:27:55] (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] kube-state-metrics: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) (owner: 10Kamila Součková) [14:28:13] (03CR) 10Clément Goubert: [C: 03+2] tegola-vector-tiles: Fix missing PTR [dns] - 10https://gerrit.wikimedia.org/r/965147 (https://phabricator.wikimedia.org/T348631) (owner: 10Clément Goubert) [14:28:45] !log Running authdns-update - T348631 [14:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:49] T348631: tegola-vector-tiles SVC records missing reverse PTRs - https://phabricator.wikimedia.org/T348631 [14:29:56] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: correct paths for edit, editor and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/965173 (https://phabricator.wikimedia.org/T347027) (owner: 10Hnowlan) [14:30:27] PROBLEM - Check systemd state on apt1002 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:49] (03Merged) 10jenkins-bot: rest-gateway: correct paths for edit, editor and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/965173 (https://phabricator.wikimedia.org/T347027) (owner: 10Hnowlan) [14:31:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:31:53] (03CR) 10JMeybohm: [C: 03+2] service::catalog: Add mw-wikifunctions - 1 [puppet] - 10https://gerrit.wikimedia.org/r/965086 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [14:33:45] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:16] (03CR) 10AikoChou: [C: 03+2] ml-services: update revertrisk-la docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965146 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [14:35:09] (03Merged) 10jenkins-bot: ml-services: update revertrisk-la docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965146 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [14:36:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:37:05] (03PS1) 10JMeybohm: service::catalog: Add mw-wikifunctions - 2 [puppet] - 10https://gerrit.wikimedia.org/r/965175 (https://phabricator.wikimedia.org/T347544) [14:37:07] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) @MatthewVernon Installed last nvme card into moss-be1003 [14:37:09] (03PS1) 10JMeybohm: service::catalog: Add mw-wikifunctions - 3 [puppet] - 10https://gerrit.wikimedia.org/r/965176 (https://phabricator.wikimedia.org/T347544) [14:37:11] (03PS1) 10JMeybohm: service::catalog: Add mw-wikifunctions - 4 [puppet] - 10https://gerrit.wikimedia.org/r/965177 (https://phabricator.wikimedia.org/T347544) [14:37:24] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10Jclark-ctr) [14:37:49] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:38:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:17] (03PS3) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 [14:39:34] (03CR) 10Vgutierrez: [C: 03+1] service::catalog: Add mw-wikifunctions - 2 [puppet] - 10https://gerrit.wikimedia.org/r/965175 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [14:40:03] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:49] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [14:42:59] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Clement_Goubert) 05Open→03In progress [14:43:25] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:18] (03PS1) 10Elukey: profile::prometheus::k8s: drop unused labels for k8s-pods-kserve [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456) [14:45:26] !log disabling puppet on 'P{O:lvs::balancer} and (A:codfw or A:eqiad)' [14:45:27] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [14:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:43] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [14:46:13] (03CR) 10JMeybohm: [C: 03+2] service::catalog: Add mw-wikifunctions - 2 [puppet] - 10https://gerrit.wikimedia.org/r/965175 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [14:47:54] (03PS5) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 [14:48:03] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, and 2 others: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10Jclark-ctr) @taavi What vlan are these going to be I would like to verify with @cmooney that these can go into these racks before i physically move them. [14:48:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:29] !log running puppet on 'O:lvs::balancer' [14:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:39] (03CR) 10Eevans: [C: 03+1] Move restbase canary [puppet] - 10https://gerrit.wikimedia.org/r/965174 (https://phabricator.wikimedia.org/T328490) (owner: 10Muehlenhoff) [14:50:43] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: block dockerhub on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/965157 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto) [14:52:07] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:10] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44006/console" [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [14:52:15] !log restarting pybal on lvs1020 and lvs2014 [14:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:52] (03CR) 10Muehlenhoff: [C: 03+2] Move restbase canary [puppet] - 10https://gerrit.wikimedia.org/r/965174 (https://phabricator.wikimedia.org/T328490) (owner: 10Muehlenhoff) [14:54:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [14:54:39] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:54:53] this is me [14:55:19] !log restarting pybal on lvs1019 and lvs2013 [14:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:27] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 78 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [14:55:51] that's jayme :) [14:56:28] thats right :) [14:57:29] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:46] thats kind of me as well [14:59:20] (03PS1) 10Ilias Sarantopoulos: admin_ng/ml-serve: add namespace permissions for llm [puppet] - 10https://gerrit.wikimedia.org/r/965180 (https://phabricator.wikimedia.org/T348661) [14:59:52] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:59:56] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, and 2 others: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) Thanks @Jclark-ctr yes these can go in E4 or F4 no problem. [15:00:00] PROBLEM - HTTP on apt1002 is CRITICAL: connect to address 208.80.154.10 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/APT_repository [15:00:20] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:26] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [15:00:28] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964946 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [15:01:33] (03CR) 10JMeybohm: [C: 03+2] service::catalog: Add mw-wikifunctions - 3 [puppet] - 10https://gerrit.wikimedia.org/r/965176 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [15:01:42] PROBLEM - HTTPS on apt1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/APT_repository [15:03:03] (03PS1) 10Ilias Sarantopoulos: admin_ng: add llm namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/965181 (https://phabricator.wikimedia.org/T348661) [15:04:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on apt1002.wikimedia.org with reason: setup in progress [15:05:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on apt1002.wikimedia.org with reason: setup in progress [15:05:56] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route pageviews to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/964946 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [15:07:21] (03PS1) 10Muehlenhoff: Always restart parsoid-rt/parsoid-rt-client on failures [puppet] - 10https://gerrit.wikimedia.org/r/965183 [15:09:49] (03CR) 10Klausman: [C: 03+1] admin_ng/ml-serve: add namespace permissions for llm [puppet] - 10https://gerrit.wikimedia.org/r/965180 (https://phabricator.wikimedia.org/T348661) (owner: 10Ilias Sarantopoulos) [15:09:54] (03CR) 10Klausman: [C: 03+1] admin_ng: add llm namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/965181 (https://phabricator.wikimedia.org/T348661) (owner: 10Ilias Sarantopoulos) [15:10:29] (03CR) 10Klausman: [C: 03+1] profile::prometheus::k8s: drop unused labels for k8s-pods-kserve [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [15:12:31] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] aborrero: drop access [labs/private] - 10https://gerrit.wikimedia.org/r/964926 (owner: 10Arturo Borrero Gonzalez) [15:12:33] (03PS1) 10Hnowlan: trafficserver: correct pageviews paths [puppet] - 10https://gerrit.wikimedia.org/r/965184 (https://phabricator.wikimedia.org/T336391) [15:12:42] (03CR) 10JMeybohm: [C: 03+2] service::catalog: Add mw-wikifunctions - 4 [puppet] - 10https://gerrit.wikimedia.org/r/965177 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [15:14:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:15:39] (03PS1) 10Ssingh: hiera: announce ns0 IP from bird (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/965187 (https://phabricator.wikimedia.org/T348041) [15:16:17] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10MatthewVernon) [15:16:52] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44007/console" [puppet] - 10https://gerrit.wikimedia.org/r/965187 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [15:17:14] (03CR) 10Klausman: [C: 03+2] admin_ng/ml-serve: add namespace permissions for llm [puppet] - 10https://gerrit.wikimedia.org/r/965180 (https://phabricator.wikimedia.org/T348661) (owner: 10Ilias Sarantopoulos) [15:17:49] (03CR) 10Klausman: [C: 03+2] admin_ng: add llm namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/965181 (https://phabricator.wikimedia.org/T348661) (owner: 10Ilias Sarantopoulos) [15:18:09] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5001.eqsin.wmnet with OS bookworm [15:18:18] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm [15:18:59] (03CR) 10Ssingh: [V: 03+1] "To be merged tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/965187 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [15:20:13] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:20:43] (03PS1) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) [15:20:45] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:21:08] (03CR) 10Fabfur: [C: 03+1] trafficserver: correct pageviews paths [puppet] - 10https://gerrit.wikimedia.org/r/965184 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [15:21:36] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:21:41] (03CR) 10Hnowlan: [C: 03+2] trafficserver: correct pageviews paths [puppet] - 10https://gerrit.wikimedia.org/r/965184 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [15:22:09] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:22:12] (03CR) 10Subramanya Sastry: [C: 03+1] Always restart parsoid-rt/parsoid-rt-client on failures [puppet] - 10https://gerrit.wikimedia.org/r/965183 (owner: 10Muehlenhoff) [15:22:51] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:23:02] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:23:09] (03PS3) 10JMeybohm: Add mw-wikifunctions discovery records [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544) [15:23:11] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:24:35] (03CR) 10Klausman: [C: 03+1] ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [15:24:58] (03CR) 10JMeybohm: [C: 03+2] Add mw-wikifunctions discovery records [dns] - 10https://gerrit.wikimedia.org/r/965065 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [15:25:00] (03PS6) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 [15:25:04] !log depool ncredir5001 [15:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:40] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye [15:26:22] (03CR) 10Muehlenhoff: "The earlier version has a logic error; having a ferm::service without $ferm_srange is actually supported and results in a firewall def wit" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [15:26:40] (03CR) 10Muehlenhoff: [C: 03+2] Always restart parsoid-rt/parsoid-rt-client on failures [puppet] - 10https://gerrit.wikimedia.org/r/965183 (owner: 10Muehlenhoff) [15:27:08] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:27:47] (03PS1) 10Ilias Sarantopoulos: APIGW: add entry for llm langid LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507) [15:27:58] 10SRE, 10ops-eqiad, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{1001..1009}.eqiad.wmnet - https://phabricator.wikimedia.org/T348144 (10Jclark-ctr) 05Open→03Resolved [15:28:23] 10SRE, 10ops-eqiad, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{1001..1009}.eqiad.wmnet - https://phabricator.wikimedia.org/T348144 (10Jclark-ctr) a:03Jclark-ctr [15:30:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [15:32:16] (03CR) 10David Caro: [C: 03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/965155 (owner: 10Majavah) [15:34:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:36:03] (03CR) 10Elukey: ml-services: add langid in llm namespace (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [15:38:42] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, and 2 others: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) [15:45:07] (03PS1) 10Jclark-ctr: add stat1011 to autoinstall and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/965193 (https://phabricator.wikimedia.org/T342454) [15:45:59] (03CR) 10Jclark-ctr: [C: 03+2] add stat1011 to autoinstall and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/965193 (https://phabricator.wikimedia.org/T342454) (owner: 10Jclark-ctr) [15:52:29] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye [15:52:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1011.eqiad.wmnet with OS bullseye [15:52:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [15:52:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed with errors:... [15:53:36] !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache mw-wikifunctions.discovery.wmnet on eqiad recursors [15:53:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw-wikifunctions.discovery.wmnet on eqiad recursors [15:53:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! Thank you for being mindful of extra labels/metrics" [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [15:53:46] !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache mw-wikifunctions.discovery.wmnet on codfw recursors [15:53:47] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw-wikifunctions.discovery.wmnet on codfw recursors [15:54:39] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5001.eqsin.wmnet with reason: host reimage [15:55:20] (03CR) 10David Caro: [C: 03+1] "👍" [alerts] - 10https://gerrit.wikimedia.org/r/965154 (owner: 10Majavah) [15:55:31] (03CR) 10Majavah: [C: 03+2] team-wmcs: ceph: cleanup summaries of existing alerts [alerts] - 10https://gerrit.wikimedia.org/r/965154 (owner: 10Majavah) [15:55:34] (03CR) 10Majavah: [C: 03+2] team-wmcs: ceph: add alert for slow ops [alerts] - 10https://gerrit.wikimedia.org/r/965155 (owner: 10Majavah) [15:56:45] (03CR) 10CI reject: [V: 04-1] team-wmcs: ceph: cleanup summaries of existing alerts [alerts] - 10https://gerrit.wikimedia.org/r/965154 (owner: 10Majavah) [15:56:47] (03CR) 10CI reject: [V: 04-1] team-wmcs: ceph: add alert for slow ops [alerts] - 10https://gerrit.wikimedia.org/r/965155 (owner: 10Majavah) [15:57:26] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5001.eqsin.wmnet with reason: host reimage [16:04:11] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:08:08] (03CR) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [16:10:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:23:03] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:13] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir5001.eqsin.wmnet with OS bookworm [16:29:23] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm completed: - ncredir5001 (**PASS**) - Removed from Pup... [16:31:57] (03PS1) 10Majavah: Don't double-escape link contents [extensions/GlobalBlocking] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965207 (https://phabricator.wikimedia.org/T348669) [16:33:03] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:10] (03CR) 10Jforrester: [C: 03+1] Don't double-escape link contents [extensions/GlobalBlocking] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965207 (https://phabricator.wikimedia.org/T348669) (owner: 10Majavah) [16:33:19] jouncebot: nowandnext [16:33:19] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [16:33:19] In 0 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1700) [16:33:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965207 (https://phabricator.wikimedia.org/T348669) (owner: 10Majavah) [16:35:48] (03Merged) 10jenkins-bot: Don't double-escape link contents [extensions/GlobalBlocking] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965207 (https://phabricator.wikimedia.org/T348669) (owner: 10Majavah) [16:36:17] !log taavi@deploy2002 Started scap: Backport for [[gerrit:965207|Don't double-escape link contents (T348669)]] [16:36:21] T348669: GlobalBlocking navigation bar is double-escaped - https://phabricator.wikimedia.org/T348669 [16:37:23] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:25] 10SRE, 10Infrastructure-Foundations, 10Puppet CI: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10thcipriani) Home directory was cleaned up. Removing our team tag since immediate problem was isolated, and SRE maintain the puppet-diff project. Ping if I missed anything! [16:37:41] !log taavi@deploy2002 taavi: Backport for [[gerrit:965207|Don't double-escape link contents (T348669)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:38:07] !log taavi@deploy2002 taavi: Continuing with sync [16:39:30] (03PS1) 10DLynch: Remove override to allow mobile edit notices to display on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965205 (https://phabricator.wikimedia.org/T316178) [16:40:55] 10SRE, 10Infrastructure-Foundations, 10Puppet CI: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10ssingh) 05Open→03Resolved a:03ssingh Marking this as resolved as the person who initially reported this. Thanks for the help everyone! [16:41:42] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10M2k_dewiki) [16:42:44] (03PS3) 10Jforrester: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388) [16:42:48] (03PS3) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) [16:42:49] (03PS3) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) [16:43:53] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:965207|Don't double-escape link contents (T348669)]] (duration: 07m 35s) [16:43:56] * taavi done [16:44:11] T348669: GlobalBlocking navigation bar is double-escaped - https://phabricator.wikimedia.org/T348669 [16:44:37] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye [16:44:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [16:44:45] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1011.eqiad.wmnet with OS bullseye [16:44:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed wit... [16:46:07] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:56] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1011'] [16:47:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['stat1011'] [16:48:51] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye [16:48:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1011.eqiad.wmnet with OS bullseye [16:48:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [16:49:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed wit... [16:50:16] (03PS4) 10Jforrester: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388) [16:50:18] (03PS4) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) [16:50:20] (03PS4) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) [16:50:22] (03PS1) 10Jforrester: wikifunctions: Simplify releases/environments config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965226 [16:51:58] (03PS2) 10Ryan Kemper: rdf-streaming-updater: restrict space usage alert from 1TiB to 50GiB [alerts] - 10https://gerrit.wikimedia.org/r/964934 (owner: 10DCausse) [16:52:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [16:53:19] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:36] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye [16:53:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [16:55:19] jouncebot: nowandnext [16:55:19] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [16:55:19] In 0 hour(s) and 4 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1700) [16:55:24] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Simplify releases/environments config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965226 (owner: 10Jforrester) [16:55:26] (03PS1) 10JMeybohm: Add appserver, api and jobrunner SANs to mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/965227 (https://phabricator.wikimedia.org/T347544) [16:56:09] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:11] (03Merged) 10jenkins-bot: wikifunctions: Simplify releases/environments config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965226 (owner: 10Jforrester) [16:57:07] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:57:12] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:58:38] (03CR) 10JMeybohm: "Not sure if it's worth it to separate this by mw release. WDYT?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965227 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1700) [17:00:29] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:19] (03PS2) 10JMeybohm: Add appserver, api and jobrunner SANs to mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/965227 (https://phabricator.wikimedia.org/T347544) [17:03:41] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3004.esams.wmnet with OS bookworm [17:03:51] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir3004.esams.wmnet with OS bookworm [17:05:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:10:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:12:05] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:15:37] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:16:25] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:22:16] er? [17:24:09] 2001:504:61:0:6:1374:0:1, ipv6.de-cix.dfw.us.as398196.cobaltridge.com. hm ok [17:24:41] hm what's up with the git_pull_charts alert? [17:26:11] hnowlan: there seem to be some local changes in /srv/deployment-charts blocking git pulls, and you have related-looking SAL entries, can you fix those? [17:26:50] taavi: agh, fixing [17:27:44] taavi: done, thanks for the heads-up [17:27:57] !log repool cp2030 for service=cdn [17:27:59] (03CR) 10Bking: [C: 03+2] admin: Add cirrus-streaming-updater namespace to flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/964567 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [17:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:02] (03PS1) 10Majavah: helpfile: Cleanup chart pull timer [puppet] - 10https://gerrit.wikimedia.org/r/965229 [17:28:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage [17:28:07] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:33] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Four files repeatedly failed to upload today... [17:30:37] hnowlan: thanks, although now it looks `helmfile` is showing some diffs between the applied state and the values file [17:32:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage [17:35:36] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10Lhiraide) Hi @NMariano-WMF that would be great! Thank you all so much for your help! [17:36:29] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Don-vip) The problem is worse today, I have now 5 files that have not been uploaded: - https://co... [17:41:59] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10NMariano-WMF) Hi @Lhiraide and @ssingh, I sent out an invite tomorrow. I don't think we'll need the full time for the meeting, but wanted to be safe just in case we did. Let me know if that time doesn't... [17:46:56] jouncebot: nowandnext [17:46:56] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1700) [17:46:57] In 0 hour(s) and 13 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1800) [17:46:57] In 0 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1800) [17:47:00] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:47:04] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:47:09] OK, good. [17:49:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:49:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:50:32] is logstash working okay for everyone? Searches aren't returning any results, and there's a lot of "Could not index event to OpenSearch. status: 400" [17:51:17] TheresNoTime: I just loaded the MW-NEW-errors dash OK. Is it a specific dash that's broken? Or a timerange? [17:52:28] oh wait one.. [17:52:47] TheresNoTime: if you see "dlq-*" in top-left corner, change it to "logstash-*" [17:52:50] huh, okay, false alarm — the "index pattern" has changed.. [17:52:53] yeah [17:52:57] Aha. [17:53:04] i was also confused by that a few days ago [17:53:24] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [17:54:21] (03Merged) 10jenkins-bot: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [17:54:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:55:40] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:55:49] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3004.esams.wmnet with OS bookworm [17:55:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir3004.esams.wmnet with OS bookworm completed: - ncredir3004 (**WARN**) - Downtimed on Ici... [17:56:02] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:56:13] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) @NMariano-WMF: Thanks, accepted! [17:56:31] (03PS5) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) [17:56:33] (03PS5) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) [17:59:43] (03PS1) 10RLazarus: Revert "admin: Temporarily add a second ssh key for rzl" [puppet] - 10https://gerrit.wikimedia.org/r/965209 [18:00:04] hashar and jeena: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1800) [18:00:04] hashar and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T1800). nyaa~ [18:00:48] (03CR) 10RLazarus: [C: 03+2] Revert "admin: Temporarily add a second ssh key for rzl" [puppet] - 10https://gerrit.wikimedia.org/r/965209 (owner: 10RLazarus) [18:01:01] (03PS1) 10Jforrester: wikifunctions: Define different ports for different service releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/965234 (https://phabricator.wikimedia.org/T343388) [18:02:34] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Define different ports for different service releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/965234 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [18:03:23] (03Merged) 10jenkins-bot: wikifunctions: Define different ports for different service releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/965234 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [18:04:44] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:05:26] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:06:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [18:07:06] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:07:38] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:07:42] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:08:10] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:14:02] (03PS6) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) [18:14:04] (03PS6) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) [18:14:06] (03PS1) 10Jforrester: wikifunctions: Move orchestrator config from chart to service values [deployment-charts] - 10https://gerrit.wikimedia.org/r/965237 (https://phabricator.wikimedia.org/T343388) [18:15:28] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:17:11] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Move orchestrator config from chart to service values [deployment-charts] - 10https://gerrit.wikimedia.org/r/965237 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [18:17:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [18:18:16] (03Merged) 10jenkins-bot: wikifunctions: Move orchestrator config from chart to service values [deployment-charts] - 10https://gerrit.wikimedia.org/r/965237 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [18:18:47] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3003.esams.wmnet with OS bookworm [18:18:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52910 and previous config saved to /var/cache/conftool/dbconfig/20231011-181849-arnaudb.json [18:18:53] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:18:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir3003.esams.wmnet with OS bookworm [18:19:08] (03PS1) 10Jforrester: specials: Use correct title in NewPagesPager [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965211 (https://phabricator.wikimedia.org/T348665) [18:19:50] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stat1011.eqiad.wmnet with reason: host reimage [18:21:06] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:21:50] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:22:08] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:27] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:23:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat1011.eqiad.wmnet with reason: host reimage [18:23:17] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:23:21] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:24:10] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:24:46] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:25:00] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:27:19] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [18:28:13] (03Merged) 10jenkins-bot: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [18:28:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:31:02] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:31:33] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:32:06] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:33:01] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:33:05] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:33:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:33:53] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:33:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P52911 and previous config saved to /var/cache/conftool/dbconfig/20231011-183355-arnaudb.json [18:34:32] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [18:35:22] (03Merged) 10jenkins-bot: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [18:35:46] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:35:49] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:36:01] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:36:04] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:36:07] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:36:09] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:43:06] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage [18:43:29] (03PS1) 10Jforrester: wikifunctions: Rev charts to 0.2.0, move TODOs around for clarity [deployment-charts] - 10https://gerrit.wikimedia.org/r/965239 [18:45:00] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10M2k_dewiki) [18:46:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage [18:46:57] (03CR) 10Jforrester: [C: 04-1] "This is a much bigger diff than expected! To investigate." [deployment-charts] - 10https://gerrit.wikimedia.org/r/965239 (owner: 10Jforrester) [18:47:40] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) I lab tested this and the "always-compare-med" command works as expected (see P52912). >>! In T348446#9238640, @ayounsi wrote: > Some of our... [18:48:43] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:49:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P52913 and previous config saved to /var/cache/conftool/dbconfig/20231011-184902-arnaudb.json [18:49:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:49:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stat1011.eqiad.wmnet with OS bullseye [18:49:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye completed: -... [18:53:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [18:53:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) 05Open→03Resolved [19:04:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T343198)', diff saved to https://phabricator.wikimedia.org/P52914 and previous config saved to /var/cache/conftool/dbconfig/20231011-190408-arnaudb.json [19:04:18] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:08:13] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1101'] [19:10:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3003.esams.wmnet with OS bookworm [19:10:23] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir3003.esams.wmnet with OS bookworm completed: - ncredir3003 (**PASS**) - Downtimed on Ici... [19:12:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [19:12:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir2002.codfw.wmnet with OS bookworm [19:12:29] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir2002.codfw.wmnet with OS bookworm [19:14:15] (03CR) 10Ebernhardson: [C: 03+2] rdf-streaming-updater: restrict space usage alert from 1TiB to 50GiB [alerts] - 10https://gerrit.wikimedia.org/r/964934 (owner: 10DCausse) [19:15:29] (03Merged) 10jenkins-bot: rdf-streaming-updater: restrict space usage alert from 1TiB to 50GiB [alerts] - 10https://gerrit.wikimedia.org/r/964934 (owner: 10DCausse) [19:23:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:27:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [19:32:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [19:37:10] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir2002.codfw.wmnet with reason: host reimage [19:40:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir2002.codfw.wmnet with reason: host reimage [19:43:26] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Don-vip) Another one: - https://commons.wikimedia.org/wiki/File:2016GHRCUWGMeeting_(29211169404).... [19:44:14] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1102.mgmt.eqiad.wmnet with reboot policy FORCED [19:49:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:52:20] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:54:38] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir2002.codfw.wmnet with OS bookworm [19:54:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir2002.codfw.wmnet with OS bookworm completed: - ncredir2002 (**WARN**) - Downtimed on Ici... [19:55:20] (03PS2) 10Samtar: Enable Edit Check on initial partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963084 (https://phabricator.wikimedia.org/T347908) (owner: 10DLynch) [19:56:47] Hi, is there some maintenance or something like that? Commons is throwing me "Failed to commit operations" when I'm using FileImporter for moving files from Serbian Wikipedia to Commons? [19:57:58] So far, second file wasn't completly imported, because of that error, so I had to ask on #wikimedia-commons for deletion. Files don't have much revisions, just about 5-6 revisions. [19:58:10] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:11] And 3 previous versions of images. [20:00:01] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1104'] [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T2000) [20:00:05] kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1104'] [20:00:20] 👋🏻 [20:00:27] I can deploy :) [20:00:37] Kizule: could you log a task? [20:00:52] Kemayo: starting with 963084 [20:00:56] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963084 (https://phabricator.wikimedia.org/T347908) (owner: 10DLynch) [20:01:30] TheresNoTime: Sounds good [20:01:56] (03Merged) 10jenkins-bot: Enable Edit Check on initial partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963084 (https://phabricator.wikimedia.org/T347908) (owner: 10DLynch) [20:02:25] !log samtar@deploy2002 Started scap: Backport for [[gerrit:963084|Enable Edit Check on initial partner wikis (T347908)]] [20:02:39] T347908: [Config] Enable Edit Check (References) at initial partner wikis - https://phabricator.wikimedia.org/T347908 [20:02:50] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:02:52] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [20:03:16] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [20:03:20] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [20:03:47] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [20:03:49] !log samtar@deploy2002 samtar and kemayo: Backport for [[gerrit:963084|Enable Edit Check on initial partner wikis (T347908)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:03:51] Kizule: I did a little bit of digging, maybe T348688 [20:03:58] Kemayo: live on mwdebug, can you test? :) [20:04:01] !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:04:04] T348688: FileBackendStore::ingestFreshFileStats: Could not stat file - https://phabricator.wikimedia.org/T348688 [20:04:08] !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:04:24] TheresNoTime: It seems to be working fine, thanks! [20:04:30] !log samtar@deploy2002 samtar and kemayo: Continuing with sync [20:04:45] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir2001.codfw.wmnet with OS bookworm [20:04:47] !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [20:04:53] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [20:04:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir2001.codfw.wmnet with OS bookworm [20:05:14] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:05:20] (03PS2) 10Samtar: Remove override to allow mobile edit notices to display on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965205 (https://phabricator.wikimedia.org/T316178) (owner: 10DLynch) [20:07:35] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:07:38] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:08:00] (03PS1) 10Jdlrobson: Beta cluster: mobile web click tracking schema at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965246 (https://phabricator.wikimedia.org/T346106) [20:09:58] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:963084|Enable Edit Check on initial partner wikis (T347908)]] (duration: 07m 32s) [20:10:05] T347908: [Config] Enable Edit Check (References) at initial partner wikis - https://phabricator.wikimedia.org/T347908 [20:10:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965205 (https://phabricator.wikimedia.org/T316178) (owner: 10DLynch) [20:11:14] (03Merged) 10jenkins-bot: Remove override to allow mobile edit notices to display on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965205 (https://phabricator.wikimedia.org/T316178) (owner: 10DLynch) [20:11:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [20:11:27] (03PS3) 10Samtar: InitialiseSettings-labs: Enable UrlShortenerEnableQrCode on all of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965240 (https://phabricator.wikimedia.org/T348487) [20:11:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [20:11:39] !log samtar@deploy2002 Started scap: Backport for [[gerrit:965205|Remove override to allow mobile edit notices to display on all wikis (T316178)]] [20:11:44] T316178: [Config Change] Make upstream mobile edit notice implementation available at all wikis - https://phabricator.wikimedia.org/T316178 [20:12:11] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye [20:12:17] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with e... [20:13:00] !log samtar@deploy2002 kemayo and samtar: Backport for [[gerrit:965205|Remove override to allow mobile edit notices to display on all wikis (T316178)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:04] Kemayo: second patch live on mwdebug [20:13:24] Checking now. [20:13:44] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:13:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [20:13:55] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [20:14:27] TheresNoTime: Okay, looks good to deploy. [20:14:33] !log samtar@deploy2002 kemayo and samtar: Continuing with sync [20:16:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:19:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:57] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:965205|Remove override to allow mobile edit notices to display on all wikis (T316178)]] (duration: 08m 18s) [20:20:02] Kemayo: both live in prod :) [20:20:02] T316178: [Config Change] Make upstream mobile edit notice implementation available at all wikis - https://phabricator.wikimedia.org/T316178 [20:20:09] TheresNoTime: great, thanks! [20:21:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:21:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965240 (https://phabricator.wikimedia.org/T348487) (owner: 10Samtar) [20:22:04] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir2001.codfw.wmnet with reason: host reimage [20:22:33] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable UrlShortenerEnableQrCode on all of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965240 (https://phabricator.wikimedia.org/T348487) (owner: 10Samtar) [20:24:21] (03PS1) 10Bking: flink-zk: Permit traffic from STAGING_KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/965248 (https://phabricator.wikimedia.org/T347075) [20:24:48] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir2001.codfw.wmnet with reason: host reimage [20:25:03] (03CR) 10Ebernhardson: [C: 03+1] flink-zk: Permit traffic from STAGING_KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/965248 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [20:26:34] (03CR) 10Bking: [C: 03+2] flink-zk: Permit traffic from STAGING_KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/965248 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [20:36:46] (03PS1) 10Ebernhardson: cirrus-streaming-updater: Correctly define the entry class [deployment-charts] - 10https://gerrit.wikimedia.org/r/965249 (https://phabricator.wikimedia.org/T347075) [20:38:16] (03CR) 10Bking: [C: 03+2] cirrus-streaming-updater: Correctly define the entry class [deployment-charts] - 10https://gerrit.wikimedia.org/r/965249 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [20:39:31] (03CR) 10Bking: [C: 03+2] cirrus-streaming-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/964928 (owner: 10Ebernhardson) [20:39:41] RECOVERY - MD RAID on ganeti1022 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:40:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir2001.codfw.wmnet with OS bookworm [20:40:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir2001.codfw.wmnet with OS bookworm completed: - ncredir2001 (**WARN**) - Downtimed on Ici... [20:40:20] (03Merged) 10jenkins-bot: cirrus-streaming-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/964928 (owner: 10Ebernhardson) [20:40:22] (03Merged) 10jenkins-bot: cirrus-streaming-updater: Correctly define the entry class [deployment-charts] - 10https://gerrit.wikimedia.org/r/965249 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [20:41:07] TheresNoTime: Sorry for not responding earlier. Can you check Logstash for Aquaman and the Lost Kingdom logo.jpg on Serbian Wikipedia? I just tried to delete it, and I got a generic error that deleting isn't possible because of local-swift-eqiad? [20:43:42] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:44:06] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:45:40] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:45:48] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:49:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:53:47] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:54:08] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1002.eqiad.wmnet with OS bookworm [20:54:19] jouncebot: nowandnext [20:54:19] For the next 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T2000) [20:54:19] In 0 hour(s) and 5 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T2100) [20:54:19] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir1002.eqiad.wmnet with OS bookworm [20:54:33] (03PS1) 10Majavah: Set WRITE_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965250 (https://phabricator.wikimedia.org/T242031) [20:54:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965250 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [20:55:30] (03Merged) 10jenkins-bot: Set WRITE_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965250 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [20:55:53] !log taavi@deploy2002 Started scap: Backport for [[gerrit:965250|Set WRITE_NEW for CA wikis on OATHAuth multiple devices (T242031)]] [20:55:58] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [20:57:13] !log taavi@deploy2002 taavi: Backport for [[gerrit:965250|Set WRITE_NEW for CA wikis on OATHAuth multiple devices (T242031)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231011T2100) [21:01:05] !log taavi@deploy2002 taavi: Continuing with sync [21:04:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:04:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:27] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:965250|Set WRITE_NEW for CA wikis on OATHAuth multiple devices (T242031)]] (duration: 10m 33s) [21:06:38] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [21:07:17] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage [21:09:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage [21:11:12] !log T348418 Rebooting `apifeatureusage1001.eqiad.wmnet` [21:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:16] T348418: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418 [21:15:42] (SystemdUnitFailed) firing: (2) ifup@ens13.service Failed on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:29] ^ Had set a downtime on icinga but not alertmanager. The apifeatureusage1001 alert should resolve soon with the host back online [21:20:42] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on apifeatureusage2001.codfw.wmnet with reason: reboot T348418 [21:20:42] (SystemdUnitFailed) resolved: (2) ifup@ens13.service Failed on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:44] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on apifeatureusage2001.codfw.wmnet with reason: reboot T348418 [21:20:46] T348418: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418 [21:23:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:26:04] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir1002.eqiad.wmnet with OS bookworm [21:26:13] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir1002.eqiad.wmnet with OS bookworm completed: - ncredir1002 (**WARN**) - Downtimed on Ici... [21:30:45] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1001.eqiad.wmnet with OS bookworm [21:30:50] !log ryankemper@cumin1001 START - Cookbook sre.hosts.remove-downtime for apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet [21:30:50] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet [21:30:57] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir1001.eqiad.wmnet with OS bookworm [21:38:52] 10SRE, 10Growth-Team, 10MW-on-K8s, 10MediaWiki-Platform-Team, and 5 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10KStoller-WMF) [21:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [21:43:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:47:02] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage [21:48:59] (03PS3) 10Bking: wdqs: bring graph split hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505) [21:49:01] (03PS13) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [21:49:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage [21:51:40] (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [21:58:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:02:08] (03PS1) 10Ebernhardson: cirrus-streaming-updater: Enable s3 for state storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/965256 (https://phabricator.wikimedia.org/T347075) [22:02:50] (03PS2) 10Ebernhardson: cirrus-streaming-updater: Enable s3 for state storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/965256 (https://phabricator.wikimedia.org/T347075) [22:05:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:05:21] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir1001.eqiad.wmnet with OS bookworm [22:05:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir1001.eqiad.wmnet with OS bookworm completed: - ncredir1001 (**WARN**) - Downtimed on Ici... [22:06:12] (03PS14) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [22:06:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [22:06:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [22:06:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [22:08:28] (03CR) 10Ebernhardson: [C: 03+2] cirrus-streaming-updater: Enable s3 for state storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/965256 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [22:09:11] (03Merged) 10jenkins-bot: cirrus-streaming-updater: Enable s3 for state storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/965256 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [22:10:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [22:12:23] (03PS15) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [22:12:45] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [22:13:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:15:31] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:15:42] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:18:02] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:18:12] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:18:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:46:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [22:47:02] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [23:05:41] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-be2003.codfw.wmnet with OS bullseye [23:05:48] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with e... [23:05:51] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Papaul) @MatthewVernon sorry to hear that you are having some issue with this server. I was able to set all the disks as JBOD like you asked. However... [23:06:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [23:09:05] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [23:09:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [23:22:25] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye [23:23:47] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [23:23:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [23:41:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:46:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:59:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state