[00:22:07] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:06] (03CR) 10Cwhite: "Looking at the full diff, there appears to be a set of quantiles configured as well. Probably don't need those." [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [00:29:53] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron) [00:30:13] (03CR) 10Cwhite: [C: 03+1] thanos: allow thanos-rule to serve /rule [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [00:30:45] (03CR) 10Cwhite: [C: 03+1] thanos: reverse-proxy /rule to rule-hosts [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [00:33:03] (03CR) 10Cwhite: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [00:33:07] (03CR) 10Cwhite: [C: 03+1] pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [00:33:20] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) Since the cause is not the same old failure of the prox... [00:39:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966829 [00:39:06] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966829 (owner: 10TrainBranchBot) [00:53:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966829 (owner: 10TrainBranchBot) [00:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:57:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [01:07:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [01:17:15] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [01:46:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [01:57:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [02:04:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:05:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:06:23] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:07:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:08:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:08:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:12:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [02:30:15] (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:38:37] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:37] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:51:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [03:56:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [04:14:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [04:19:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [04:20:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:20:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:12:12] (03CR) 10Santhosh: Update cxserver to 2023-10-12-080927-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [05:12:57] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@60950f6]: Deploying airflow [data-engineering/airflow-dags@60950f6b] [05:14:01] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [05:14:07] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [05:14:09] !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@60950f6]: Deploying airflow [data-engineering/airflow-dags@60950f6b] (duration: 01m 12s) [05:21:07] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [05:22:37] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [05:24:41] (03Abandoned) 10Tim Starling: sshd: Disable keyboard-interactive authentication [puppet] - 10https://gerrit.wikimedia.org/r/956983 (owner: 10Tim Starling) [05:35:08] (03PS3) 10Tim Starling: Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) [05:36:19] (03PS2) 10Tim Starling: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) [05:42:32] (03PS1) 10Marostegui: install_server: Do not reimage db1231 [puppet] - 10https://gerrit.wikimedia.org/r/966943 [05:43:08] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1231 [puppet] - 10https://gerrit.wikimedia.org/r/966943 (owner: 10Marostegui) [05:48:55] (03CR) 10Tim Starling: "I went for an 80 day expiry with 8 day buckets after reviewing https://foundation.wikimedia.org/wiki/Legal:Data_retention_guidelines . May" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [05:52:24] (03PS1) 10Tim Starling: Enable source maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966945 (https://phabricator.wikimedia.org/T47514) [05:58:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T0600) [06:00:06] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T0600). [06:03:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [06:18:23] (03CR) 10Volans: [C: 03+2] documentation: expand distributed locking docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/966886 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [06:18:51] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [06:19:17] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:19:23] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:25:01] (03Merged) 10jenkins-bot: documentation: expand distributed locking docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/966886 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [06:28:01] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:28:02] (03CR) 10Volans: [C: 03+2] "Thanks John for fixing the tests for me!" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [06:29:23] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:30:15] (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:31:16] !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet [06:31:28] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet [06:32:29] !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet [06:32:57] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet [06:34:45] !log enabled distributed locking support in spicerack/cookbooks T341973 [06:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:49] T341973: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 [06:38:34] (03CR) 10Slyngshede: [C: 03+2] puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [06:40:21] (03Merged) 10jenkins-bot: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [06:43:32] (03PS9) 10Brouberol: Publish metrics reflecting skein certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) [06:43:39] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) Disributed locking is now live in Spicerack and used by the Cookbooks. For a general overview see https://doc.wikimedia.org/spicerack/master/introduction.h... [06:43:50] (03CR) 10Brouberol: Publish metrics reflecting skein certificate expiry (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [06:49:09] (03CR) 10Elukey: ml-services: deploy nllb in llm namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [06:49:59] (PuppetFailure) resolved: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:50:11] (03CR) 10Elukey: ml-services: deploy nllb in llm namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [06:51:11] (03CR) 10Elukey: [C: 03+1] Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [06:52:18] (03CR) 10Elukey: "Revoking the +1, I am trying to check one thing" [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [06:56:19] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: allow thanos-rule to serve /rule [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [06:56:28] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: reverse-proxy /rule to rule-hosts [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [06:57:52] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [06:58:09] (03CR) 10Elukey: "So I checked https://puppet-compiler.wmflabs.org/output/966497/2522/kafka-jumbo1010.eqiad.wmnet/index.html and this will happen to all the" [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [07:00:05] Amir1, apergos, and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T0700). [07:00:06] tgr: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:20] out with covid, cannot manage th e window [07:01:31] ouch take care apergos! [07:01:38] ty [07:01:51] (03PS1) 10Slyngshede: puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 [07:03:37] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:03:39] Get well soon apergos [07:06:39] (03PS1) 10Filippo Giunchedi: thanos: set external-prefix for rule [puppet] - 10https://gerrit.wikimedia.org/r/967129 (https://phabricator.wikimedia.org/T349102) [07:07:38] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: set external-prefix for rule [puppet] - 10https://gerrit.wikimedia.org/r/967129 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [07:08:01] (03CR) 10Slyngshede: "I feel like this would be handy, but let me know if there's a reason as to why we shouldn't include labels in summaries." [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede) [07:13:36] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [07:14:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [07:14:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965879 (https://phabricator.wikimedia.org/T342475) (owner: 10Gergő Tisza) [07:15:07] (03Merged) 10jenkins-bot: [beta] Make temp user config SUL-friendly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965879 (https://phabricator.wikimedia.org/T342475) (owner: 10Gergő Tisza) [07:16:17] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [07:16:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, also something to keep in mind that when alerts get aggregated in groups the number of alerts will show up on irc, though only one i" [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede) [07:17:25] !log UTC morning deploys done [07:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:03] (03PS8) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [07:20:26] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2109.codfw.wmnet with reason: db2109 downtime while repooling [07:20:28] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2109.codfw.wmnet with reason: db2109 downtime while repooling [07:21:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/60/cons" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [07:23:39] (03CR) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [07:33:56] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [07:41:31] (03CR) 10Brouberol: [C: 03+2] Publish metrics reflecting skein certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [07:43:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "I don't think we should be spending time on graphite, having said that feel free to try" [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron) [07:43:27] (03PS2) 10Ilias Sarantopoulos: ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) [07:43:44] (03CR) 10Ilias Sarantopoulos: ml-services: deploy nllb in llm namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [07:45:06] (03CR) 10Elukey: [C: 03+1] ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [07:47:23] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:54:02] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:monitoring absent Incinga check_eth. [puppet] - 10https://gerrit.wikimedia.org/r/966535 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [07:56:40] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [07:59:45] (03CR) 10Filippo Giunchedi: [C: 03+1] pyrra: add logstash requests slo [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [08:00:05] brennen and hashar: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T0800). [08:00:07] (03CR) 10Filippo Giunchedi: [C: 03+1] pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [08:06:59] (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:07:29] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:10:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:12:17] PROBLEM - Check systemd state on db2132 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-mysqld-exporter.service,wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:21:56] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [08:22:48] (03Merged) 10jenkins-bot: ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [08:28:51] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:34] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:34:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:35:23] (03CR) 10Ayounsi: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (0312 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [08:36:11] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [08:41:20] (03CR) 10Effie Mouzeli: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [08:45:12] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:15] (03CR) 10Kosta Harlan: [C: 04-2] [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [08:46:32] (03PS1) 10Elukey: profile::prometheus::k8s: drop unused Istio labels [puppet] - 10https://gerrit.wikimedia.org/r/967140 (https://phabricator.wikimedia.org/T349072) [08:51:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/71/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967140 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [08:51:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:58:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:02:34] 10SRE, 10Maps: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Nicolas_Raoul) [09:03:13] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:29] (03CR) 10Jbond: [C: 04-1] [BETA HACK] Attempt to secure Puppet DB better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [09:05:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede) [09:09:54] (03PS1) 10Filippo Giunchedi: sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) [09:11:06] (03CR) 10CI reject: [V: 04-1] sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [09:12:25] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:07] (03CR) 10Jbond: "lgtm nit/question inline" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [09:14:03] (03PS2) 10Filippo Giunchedi: sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) [09:14:05] (03PS1) 10Filippo Giunchedi: test: expand 'runbook not found' assertion [alerts] - 10https://gerrit.wikimedia.org/r/967144 [09:15:09] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:48] (03CR) 10CI reject: [V: 04-1] test: expand 'runbook not found' assertion [alerts] - 10https://gerrit.wikimedia.org/r/967144 (owner: 10Filippo Giunchedi) [09:15:54] (03CR) 10CI reject: [V: 04-1] sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [09:18:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:20:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:57] (03PS1) 10Kevin Bazira: ml-services: update rec-api-ng resource limits to match wmflabs [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) [09:30:51] (03CR) 10Fabfur: [C: 04-1] haproxy: enable healthcheck-dedicated backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:32:08] (03CR) 10Vgutierrez: [C: 04-1] haproxy: enable healthcheck-dedicated backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:32:18] (03PS1) 10Jelto: miscweb: remove the use of :latest image tag in httpd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) [09:32:48] (03PS6) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [09:35:13] (03CR) 10Vgutierrez: [C: 04-1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/966221/comments/9b08a162_1d01cf19 is still pending" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:37:56] (03CR) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [09:38:21] (03CR) 10Elukey: [V: 03+1 C: 04-1] profile::prometheus::k8s: drop unused Istio labels [puppet] - 10https://gerrit.wikimedia.org/r/967140 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [09:40:26] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [09:43:21] (03CR) 10Brouberol: [C: 03+1] Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [09:49:12] (03CR) 10Effie Mouzeli: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [09:55:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:00:05] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1000) [10:00:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:00:41] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:25] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:02:03] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:34] (03PS1) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967148 (https://phabricator.wikimedia.org/T236373) [10:07:36] (03PS1) 10Jbond: puppet: simplify debug code [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967149 [10:07:38] (03PS1) 10Jbond: tox: remove envdir optimizations [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967150 (https://phabricator.wikimedia.org/T348434) [10:07:40] (03PS1) 10Jbond: tox: add commands to allowlist_externals [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967151 [10:07:42] (03PS1) 10Jbond: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967152 [10:07:44] (03PS1) 10Jbond: debug_presentation: script to render HTML templates [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967153 [10:07:46] (03PS1) 10Jbond: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967154 [10:07:48] (03PS1) 10Jbond: Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155 [10:11:47] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:13:06] (03PS2) 10Jbond: puppet: simplify debug code [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967149 [10:13:08] (03PS2) 10Jbond: tox: remove envdir optimizations [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967150 (https://phabricator.wikimedia.org/T348434) [10:13:10] (03PS2) 10Jbond: tox: add commands to allowlist_externals [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967151 [10:13:12] (03PS2) 10Jbond: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967152 [10:13:14] (03PS2) 10Jbond: debug_presentation: script to render HTML templates [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967153 [10:13:16] (03PS2) 10Jbond: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967154 [10:13:18] (03PS2) 10Jbond: Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155 [10:16:57] (03PS9) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [10:18:21] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/80/cons" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [10:19:07] (03PS1) 10Ilias Sarantopoulos: ml-services: fix nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/967156 (https://phabricator.wikimedia.org/T349163) [10:20:37] (03CR) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [10:23:10] (03PS1) 10Jbond: html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157 [10:25:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [10:26:27] (03CR) 10Elukey: [C: 03+1] ml-services: fix nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/967156 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [10:27:16] (03PS2) 10Jbond: html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157 [10:33:54] (03CR) 10Elukey: ml-services: update rec-api-ng resource limits to match wmflabs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [10:41:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:43:33] (03PS3) 10Jbond: Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155 [10:43:35] (03PS3) 10Jbond: html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157 [10:51:05] (03CR) 10Jbond: [C: 03+2] puppet: simplify debug code [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967149 (owner: 10Jbond) [10:51:08] (03CR) 10Jbond: [C: 03+2] tox: remove envdir optimizations [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967150 (https://phabricator.wikimedia.org/T348434) (owner: 10Jbond) [10:51:12] (03CR) 10Jbond: [C: 03+2] tox: add commands to allowlist_externals [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967151 (owner: 10Jbond) [10:51:15] (03CR) 10Jbond: [C: 03+2] debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967152 (owner: 10Jbond) [10:51:20] (03CR) 10Jbond: [C: 03+2] debug_presentation: script to render HTML templates [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967153 (owner: 10Jbond) [10:51:24] (03CR) 10Jbond: [C: 03+2] Use macros for links to Gerrit and Jenkins [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967154 (owner: 10Jbond) [10:51:27] (03CR) 10Jbond: [C: 03+2] Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155 (owner: 10Jbond) [10:51:33] (03CR) 10Kevin Bazira: ml-services: update rec-api-ng resource limits to match wmflabs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [10:51:35] (03CR) 10Jbond: [C: 03+2] html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157 (owner: 10Jbond) [10:53:14] (03Merged) 10jenkins-bot: puppet: simplify debug code [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967149 (owner: 10Jbond) [10:54:46] (03Merged) 10jenkins-bot: tox: remove envdir optimizations [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967150 (https://phabricator.wikimedia.org/T348434) (owner: 10Jbond) [10:54:48] (03CR) 10Kosta Harlan: [C: 04-2] [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [10:54:51] (03Abandoned) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [10:54:54] (03PS1) 10Jbond: 2.6.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967162 [10:54:58] (03Merged) 10jenkins-bot: tox: add commands to allowlist_externals [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967151 (owner: 10Jbond) [10:55:12] (03CR) 10Jbond: [C: 03+2] 2.6.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967162 (owner: 10Jbond) [10:55:33] (03Merged) 10jenkins-bot: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967152 (owner: 10Jbond) [10:55:35] (03Merged) 10jenkins-bot: debug_presentation: script to render HTML templates [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967153 (owner: 10Jbond) [10:55:37] (03Merged) 10jenkins-bot: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967154 (owner: 10Jbond) [10:55:39] (03Merged) 10jenkins-bot: Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155 (owner: 10Jbond) [10:55:41] (03Merged) 10jenkins-bot: html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157 (owner: 10Jbond) [10:56:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:58:49] (03Merged) 10jenkins-bot: 2.6.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967162 (owner: 10Jbond) [11:01:36] (03PS1) 10Jbond: Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967163 [11:02:35] (03PS1) 10Volans: sre.deploy.python-code: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967164 (https://phabricator.wikimedia.org/T341973) [11:02:37] (03PS1) 10Volans: sre.discovery.datacenter: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) [11:02:39] (03PS1) 10Volans: sre.discovery.service-route: customize lock args [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) [11:02:41] (03PS1) 10Volans: sre.dns.netbox: make the lock exclusive [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) [11:03:37] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:01] (03CR) 10Jbond: "I ended up pointing theses at the 2.x branch in" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar) [11:04:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:27] (03Abandoned) 10Jbond: tox: remove envdir optimizations [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434) (owner: 10Hashar) [11:04:35] (03Abandoned) 10Jbond: tox: add commands to allowlist_externals [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966895 (owner: 10Hashar) [11:04:41] (03Abandoned) 10Jbond: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822 (owner: 10Hashar) [11:04:47] (03Abandoned) 10Jbond: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [11:04:54] (03Abandoned) 10Jbond: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 (owner: 10Hashar) [11:04:59] (03Abandoned) 10Jbond: Add style to HTML output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar) [11:05:17] (03CR) 10Jbond: [C: 03+2] Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967163 (owner: 10Jbond) [11:08:10] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/967170 [11:08:19] (03CR) 10Volans: sre.discovery.datacenter: customize lock arguments (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:08:40] (03Merged) 10jenkins-bot: Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967163 (owner: 10Jbond) [11:08:47] (03CR) 10Volans: [C: 04-1] "This is just a proposal" [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:09:15] (03CR) 10Volans: "LMK if you think this is too strict" [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:09:24] 10SRE, 10Traffic: HAProxy should use a single backend for Vanish - https://phabricator.wikimedia.org/T349287 (10Fabfur) [11:09:35] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/967170 (owner: 10Jbond) [11:10:22] (03PS2) 10Filippo Giunchedi: test: deal with private runbooks [alerts] - 10https://gerrit.wikimedia.org/r/967144 [11:10:24] (03PS3) 10Filippo Giunchedi: sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) [11:11:34] (03CR) 10CI reject: [V: 04-1] test: deal with private runbooks [alerts] - 10https://gerrit.wikimedia.org/r/967144 (owner: 10Filippo Giunchedi) [11:12:16] (03CR) 10CI reject: [V: 04-1] sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [11:13:07] (03PS7) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [11:13:09] (03PS1) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) [11:13:26] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:44] (03PS3) 10Filippo Giunchedi: test: deal with private runbooks [alerts] - 10https://gerrit.wikimedia.org/r/967144 [11:15:46] (03PS4) 10Filippo Giunchedi: sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) [11:15:49] (03PS8) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [11:19:38] (03PS1) 10Jelto: kubernetes::deployment_server: add common_image for httpd exporter [puppet] - 10https://gerrit.wikimedia.org/r/967174 (https://phabricator.wikimedia.org/T348856) [11:19:49] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Nice one :)" [alerts] - 10https://gerrit.wikimedia.org/r/967144 (owner: 10Filippo Giunchedi) [11:21:31] (03CR) 10Jbond: [C: 03+2] compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:23:55] RECOVERY - Check systemd state on db2132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:11] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/967156 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [11:25:04] (03Merged) 10jenkins-bot: ml-services: fix nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/967156 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [11:25:24] (03CR) 10Vgutierrez: [C: 04-1] "looking good" [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur) [11:25:44] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:29:19] (03CR) 10Ilias Sarantopoulos: "My suggestion is to limit the number of results that rec api is processing. Instead of 500 we can fetch 250/200/100 results/candidates as " [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [11:30:07] (03PS2) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) [11:30:09] (03PS9) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [11:30:11] (03CR) 10Btullis: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 28): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/82/console" [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:30:50] (03CR) 10Fabfur: haproxy: remove multiple backends choice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur) [11:30:54] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [11:31:13] (03CR) 10Fabfur: haproxy: enable healthcheck-dedicated backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [11:32:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/967164 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:32:59] (03PS2) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) [11:33:25] (03CR) 10Jbond: [C: 03+1] "lgtm, left the open questions to service ops but FTR 1 seems like the correct concurrency to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:35:59] (03CR) 10Jbond: sre.discovery.service-route: customize lock args (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:36:01] (03CR) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (0312 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [11:36:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:38:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [11:40:24] (03CR) 10Ayounsi: [C: 03+1] sre.dns.netbox: make the lock exclusive [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:43:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [11:44:59] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@6f09297] (releasing): (no justification provided) [11:46:07] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@6f09297] (releasing): (no justification provided) (duration: 01m 08s) [11:47:28] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [11:52:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) [11:53:02] (03PS1) 10Jbond: elasticsearch::relforge: remove trailing comma [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) [11:53:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [11:53:48] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) [11:54:06] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): re-write compile_redirects function - https://phabricator.wikimedia.org/T348883 (10jbond) [11:54:08] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) [11:54:14] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:54:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) [11:54:39] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): re-write compile_redirects function - https://phabricator.wikimedia.org/T348883 (10jbond) 05Open→03Stalled p:05Triage→03Medium [11:54:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:54:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/83/console" [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [11:55:08] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) 05Open→03In progress p:05Triage→03Medium [11:55:46] (03CR) 10Jbond: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [12:00:07] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1200) [12:06:59] (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:17:45] (03PS19) 10Brouberol: Define environment variables to ease the use of prometheus-metricsfetcher [puppet] - 10https://gerrit.wikimedia.org/r/967134 [12:20:00] (03CR) 10Btullis: [C: 03+1] Remove kafka-jumbo100[1-6] brokers from the inventory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:20:33] (03CR) 10Brouberol: [C: 03+2] Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:23:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:24:37] (03CR) 10Volans: [C: 03+2] sre.deploy.python-code: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967164 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:25:00] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: make the lock exclusive [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:27:26] (03CR) 10Ayounsi: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [12:27:31] PROBLEM - Check systemd state on kubernetes2015 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:28:36] (03Merged) 10jenkins-bot: sre.deploy.python-code: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967164 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:28:51] (03PS2) 10Volans: sre.dns.netbox: make the lock exclusive [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) [12:29:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:30:32] (03CR) 10Kevin Bazira: ml-services: update rec-api-ng resource limits to match wmflabs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [12:33:06] (03PS1) 10Bartosz Dziewoński: Update comment about EditAttemptStep instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 [12:34:00] (03PS5) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) [12:34:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:34:04] (03PS5) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) [12:34:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:55] (03CR) 10Bartosz Dziewoński: "@Phuedx You added the comment in I070d826f63dae9e882137fd3d9bb3a76f6622a50. To be honest, I don't really understand it – the values listed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński) [12:37:07] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:18] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:13] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [12:41:19] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1074 days) https://wikitech.wikimedia.org/wiki/Logs [12:42:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:42:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:44:19] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:35] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2015 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:44:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.729 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:46:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:39] !log volans@cumin1001 START - Cookbook sre.dns.netbox [12:50:39] !log volans@cumin2002 START - Cookbook sre.dns.netbox [12:50:45] !log volans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [12:50:46] !log volans@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [12:50:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:52:06] !log volans@cumin1001 START - Cookbook sre.dns.netbox [12:52:55] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2045 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:55:41] PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:09] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2044 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:59:25] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: noop - volans@cumin1001" [12:59:37] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/84/console" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [12:59:55] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1300) [13:00:05] dcausse and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: noop - volans@cumin1001" [13:00:15] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:00:28] o/ [13:00:42] o/ [13:02:35] (03Abandoned) 10Jforrester: jquery.tablesorter: Fix data-sort-type with numeric values [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965690 (https://phabricator.wikimedia.org/T348812) (owner: 10Jforrester) [13:03:52] * TheresNoTime is unable to deploy [13:05:23] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:13:27] (03PS2) 10Anzx: hiwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967213 (https://phabricator.wikimedia.org/T310961) [13:14:56] 10SRE-OnFire, 10Cloud-VPS, 10cloud-services-team, 10Sustainability (Incident Followup), 10User-dcaro: openstack: create a cookbook to inject commands to VMs via console at scale - https://phabricator.wikimedia.org/T347683 (10taavi) a:03taavi [13:15:11] (03PS1) 10Hashar: Add a json representation of the build [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 [13:16:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:17:09] (03CR) 10Hashar: "The build.json is the counter part of the build index. Theorically I can then integrate those data into the Gerrit check tab so people can" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 (owner: 10Hashar) [13:18:11] (03CR) 10Hashar: "I should ideally add tests to cover `presentation.json.Build`" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 (owner: 10Hashar) [13:21:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:22:47] (03CR) 10Ottomata: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [13:23:34] (03PS6) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) [13:23:50] (03PS6) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) [13:24:07] (03CR) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [13:24:09] (03PS1) 10Btullis: Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215 [13:24:11] (03CR) 10Slyngshede: [C: 03+2] puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede) [13:25:23] (03CR) 10CI reject: [V: 04-1] puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede) [13:25:38] (03CR) 10Slyngshede: [C: 03+2] puppet-agent-fail: enable check for all clusters. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [13:25:48] (03PS2) 10Btullis: Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215 [13:27:16] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/85/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967215 (owner: 10Btullis) [13:30:56] I could deploy \o @ dcausse & aanzx although I got a bit rusty in it [13:30:57] (03CR) 10Jbond: "this looks fine, minor nit in line. could you also target 2.x and update the changelog" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 (owner: 10Hashar) [13:31:29] @ aanzx I'm missing a +1 from another reviewer on your patches though [13:31:31] WMDE-Fisch: thanks! I have a meeting in 5min so it's fine to skip mine [13:31:37] kk [13:34:47] (03PS3) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) [13:34:53] (03PS3) 10Btullis: Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215 [13:35:18] (03CR) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [13:35:30] (03CR) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [13:35:32] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Cleanup Kartographer Nearby flags (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch) [13:35:33] So I guess I skip aanzx patches as well. ... Only my patch left then. [13:35:41] (03CR) 10Filippo Giunchedi: [C: 03+2] test: deal with private runbooks [alerts] - 10https://gerrit.wikimedia.org/r/967144 (owner: 10Filippo Giunchedi) [13:36:13] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/86/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967215 (owner: 10Btullis) [13:36:30] (03CR) 10Ayounsi: [C: 03+1] "Awesome!" [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [13:36:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966610 (https://phabricator.wikimedia.org/T252346) (owner: 10WMDE-Fisch) [13:36:45] (03PS4) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) [13:37:11] (03PS2) 10Slyngshede: puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 [13:40:33] (03Merged) 10jenkins-bot: Revert "Revert "Workaround to center search terms label"" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966610 (https://phabricator.wikimedia.org/T252346) (owner: 10WMDE-Fisch) [13:41:00] !log wmde-fisch@deploy2002 Started scap: Backport for [[gerrit:966610|Revert "Revert "Workaround to center search terms label"" (T252346)]] [13:41:05] T252346: AdvancedSearch namespace pillbox label is misaligned - https://phabricator.wikimedia.org/T252346 [13:42:32] !log wmde-fisch@deploy2002 wmde-fisch: Backport for [[gerrit:966610|Revert "Revert "Workaround to center search terms label"" (T252346)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:43:15] !log wmde-fisch@deploy2002 wmde-fisch: Continuing with sync [13:43:43] (03CR) 10Ayounsi: [C: 03+1] Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [13:46:36] (03PS1) 10Kevin Bazira: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966833 (https://phabricator.wikimedia.org/T347475) [13:47:45] WMDE-Fisch: hi, can I add one for beta labs? [13:47:56] Sure [13:48:00] kostajh: [13:48:03] WMDE-Fisch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/965100 [13:48:07] I'll add to the calendar now [13:48:26] (03PS3) 10WMDE-Fisch: labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) (owner: 10Kosta Harlan) [13:48:50] !log wmde-fisch@deploy2002 Finished scap: Backport for [[gerrit:966610|Revert "Revert "Workaround to center search terms label"" (T252346)]] (duration: 07m 50s) [13:48:55] T252346: AdvancedSearch namespace pillbox label is misaligned - https://phabricator.wikimedia.org/T252346 [13:49:23] WMDE-Fisch: Added [13:50:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) (owner: 10Kosta Harlan) [13:51:24] (03Merged) 10jenkins-bot: labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) (owner: 10Kosta Harlan) [13:52:06] (03PS1) 10Majavah: P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683) [13:52:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:52:19] kostajh: Done, should be synced and working. [13:52:29] danke [13:52:42] (03CR) 10Vgutierrez: [C: 04-1] haproxy: remove multiple backends choice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur) [13:52:46] gerne :-) [13:53:13] I think it will need a few more minutes to sync to beta cluster https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ [13:53:50] Ah right. [13:54:34] (03CR) 10CI reject: [V: 04-1] P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683) (owner: 10Majavah) [13:55:15] (03PS2) 10Majavah: P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683) [13:56:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] "you will like this one ;)" [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [13:57:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:57:15] (03CR) 10Slyngshede: puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede) [13:58:28] (03Merged) 10jenkins-bot: puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede) [13:58:34] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:58:37] WMDE-Fisch: and yeah, it works now that the job has run. Thanks! [13:58:53] Perfect. I'm out of here \o :-) [13:59:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Implement JavaScript password match check. [software/bitu] - 10https://gerrit.wikimedia.org/r/966123 (owner: 10Slyngshede) [14:00:29] RECOVERY - Check systemd state on kubernetes2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:40] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - jclark@cumin1001" [14:01:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - jclark@cumin1001" [14:01:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:03:55] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:03:56] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:03:59] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:05:07] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1007-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:09:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:16] (03Abandoned) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [14:12:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:14:54] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:14:58] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:15:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:15:39] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2015 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:15:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] "i thought i had updated the commit message. however this one is caused because" [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [14:16:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:11] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:27] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:21:33] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:24:50] (03CR) 10Elukey: [C: 03+1] ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966833 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [14:28:25] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:29:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:50] (03CR) 10Ottomata: [C: 03+1] Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215 (owner: 10Btullis) [14:31:13] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:31:16] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:31:18] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:32:08] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review. :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966833 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [14:32:22] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:33:13] (03Merged) 10jenkins-bot: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966833 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [14:34:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet1007-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:34:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:35:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:35:35] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:38:24] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:38:38] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:49] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:39:09] (03CR) 10JMeybohm: [C: 03+1] kubernetes::deployment_server: add common_image for httpd exporter [puppet] - 10https://gerrit.wikimedia.org/r/967174 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [14:39:25] hmmm anybody working on thanos? [14:39:41] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:39:47] PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:39:48] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:39:51] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:40:05] PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:43:21] I think elukey is aware, titan1001 is indeed not in great shape [14:43:38] (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:06] I am yes, rebooting titan1001 vgutierrez [14:44:17] ack [14:44:41] !log powercycle titan1001 [14:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:59] (03CR) 10Kevin Bazira: ml-services: update rec-api-ng resource limits to match wmflabs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [14:49:11] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:49:17] RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:49:27] (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:33] RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Fri 03 Nov 2023 08:51:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:50:09] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:49] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:51:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:51:11] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:51:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:51:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:51:59] (03CR) 10Klausman: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 (owner: 10Klausman) [14:52:16] (03PS2) 10Klausman: images: Add Go 1.21 image, based on bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 [14:53:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) [14:53:42] (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [14:54:31] (03PS1) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:55:19] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev'] [14:55:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:55:30] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1009-dev'] [14:55:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1009-dev'] [14:55:41] (03PS1) 10Ilias Sarantopoulos: ml-services: add autoscaling for langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/967230 (https://phabricator.wikimedia.org/T340507) [14:55:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev'] [14:55:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1009-dev'] [14:55:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1009-dev'] [14:56:17] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:56:23] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:23] PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:56:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev'] [14:56:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev'] [14:56:39] PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:57:24] wow again? [14:57:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [14:58:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1009-dev.eqiad.wmnet'] [14:58:24] !log powercycle titan1001 [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:38] (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:27] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [14:59:56] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet1008-dev.eqiad.wmnet'] [14:59:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet1007-dev.eqiad.wmnet'] [15:00:29] PROBLEM - Host titan1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:35] hmm [15:00:39] oh elukey ok [15:01:40] sukhe: yes yes it is me sorry :( [15:01:55] RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 7.105 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:01:56] np all good! was checking because I am on-call :) [15:01:57] RECOVERY - Host titan1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:02:02] (03CR) 10Ebernhardson: rdf-streaming-updater: update staging values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:02:05] RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Fri 03 Nov 2023 08:51:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:03:05] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:03:19] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:03:36] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:03:38] (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Fabfur) [15:04:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [15:04:45] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [15:05:01] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:05:41] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [15:06:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [15:07:34] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:07:39] PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:50] PROBLEM - Kafka Broker Server #page on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:07:50] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:07:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol1009-dev.eqiad.wmnet'] [15:08:11] elukey: is that you? sorry :) [15:08:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [15:08:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [15:08:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [15:08:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet'] [15:08:32] !incidents [15:08:32] 4136 (ACKED) kafka-jumbo1001/Kafka Broker Server (paged) [15:08:34] sukhe: nono it is brouberol [15:08:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudnet1007-dev.eqiad.wmnet'] [15:08:37] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:08:37] denisse: ACKed [15:08:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudnet1008-dev.eqiad.wmnet'] [15:08:39] they are decomming old servers [15:08:45] thanks elukey and sorry :P [15:08:47] btullis, brouberol --^ [15:09:00] here, sorry, was making tea [15:09:17] PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:19] nothing is exploding, only a false alert due to missing downtime [15:09:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet1008-dev'] [15:09:22] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet1007-dev'] [15:09:28] PROBLEM - Kafka Broker Server #page on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:09:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudnet1007-dev'] [15:09:29] ah, thanks, I'll go back to my tea [15:09:29] sukhe: it's related to the decommission, right?? [15:09:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudnet1008-dev'] [15:09:33] denisse: yep [15:09:46] err, I just got p.aged again [15:09:50] yeah [15:09:51] !incidents [15:09:51] 4136 (ACKED) kafka-jumbo1001/Kafka Broker Server (paged) [15:09:52] 4137 (ACKED) kafka-jumbo1002/Kafka Broker Server (paged) [15:10:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) [15:10:39] re kafka pages: sorry, my fault. I missed setting a silence [15:10:48] it's all good, we stopped the services on purpose [15:11:13] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967232 (https://phabricator.wikimedia.org/T347075) [15:11:13] PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:17] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:11:36] PROBLEM - Kafka Broker Server #page on kafka-jumbo1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:11:51] I'll silence the hosts in incinga [15:12:01] thanks brouberol! [15:12:06] we got one more, so ACKing [15:12:14] sorry again, my faulth [15:12:17] (03CR) 10DCausse: rdf-streaming-updater: update staging values (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:12:18] thanks. Are we going to get more? [15:12:25] * Emperor still half-way through making this tea... [15:12:35] I guess running up the stairs every time is good exercise [15:13:02] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1001.eqiad.wmnet with reason: host is being decommissioned [15:13:15] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1001.eqiad.wmnet with reason: host is being decommissioned [15:13:18] Emperor: we should be good now (everything silenced) [15:13:25] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1002.eqiad.wmnet with reason: host is being decommissioned [15:13:49] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1002.eqiad.wmnet with reason: host is being decommissioned [15:13:55] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:13:55] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1003.eqiad.wmnet with reason: host is being decommissioned [15:14:19] PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:19] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1003.eqiad.wmnet with reason: host is being decommissioned [15:14:22] PROBLEM - Kafka Broker Server #page on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:14:26] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1004.eqiad.wmnet with reason: host is being decommissioned [15:14:39] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1004.eqiad.wmnet with reason: host is being decommissioned [15:14:45] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1005.eqiad.wmnet with reason: host is being decommissioned [15:14:54] one more, ACKed [15:14:58] !incidents [15:14:59] 4136 (ACKED) kafka-jumbo1001/Kafka Broker Server (paged) [15:14:59] 4137 (ACKED) kafka-jumbo1002/Kafka Broker Server (paged) [15:14:59] 4138 (ACKED) kafka-jumbo1003/Kafka Broker Server (paged) [15:14:59] 4139 (RESOLVED) kafka-jumbo1004/Kafka Broker Server (paged) [15:15:10] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1005.eqiad.wmnet with reason: host is being decommissioned [15:15:16] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1006.eqiad.wmnet with reason: host is being decommissioned [15:15:35] alright, I've silence all 6 hosts in icinga [15:15:41] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1006.eqiad.wmnet with reason: host is being decommissioned [15:16:02] sorry again folks [15:17:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:17:24] brouberol: all good! [15:22:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) This is now fixed in esams, solution that's been applied is to add a community on sessions to LVS servers if the MED is... [15:24:10] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967232 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [15:24:57] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967232 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [15:25:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED [15:26:52] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ssingh) Thanks, confirming this is working: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=33 [15:28:06] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) Actually there is a caveat, traffic from other servers on asw1-bw27-esams will still route out via lvs3010, until I impl... [15:30:06] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:30:19] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:31:59] (03PS1) 10Fabfur: hiera: enable dual disk storage for new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/967235 [15:32:26] (03CR) 10CI reject: [V: 04-1] hiera: enable dual disk storage for new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/967235 (owner: 10Fabfur) [15:33:51] (03PS2) 10Fabfur: hiera: enable dual disk storage for new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) [15:34:34] 10SRE, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:34:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:38:37] (03PS1) 10Volans: sre.puppet.sync-netbox-hiera: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967236 (https://phabricator.wikimedia.org/T341973) [15:40:39] 10SRE, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:41:09] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@a311c5d]: (no justification provided) [15:42:03] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@a311c5d]: (no justification provided) (duration: 00m 54s) [15:43:11] (03CR) 10Ssingh: hiera: enable dual disk storage for new cp hosts in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [15:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:39] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) [15:47:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/967236 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:48:12] (03CR) 10Volans: [C: 03+2] sre.puppet.sync-netbox-hiera: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967236 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:49:18] (03PS1) 10Ssingh: hiera: add host override for cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/967239 (https://phabricator.wikimedia.org/T349244) [15:49:23] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:40] (03CR) 10Herron: profile::mediawiki::common: set default histogram buckets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [15:50:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/88/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967239 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh) [15:52:20] (03CR) 10Ssingh: hiera: enable dual disk storage for new cp hosts in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [15:52:28] (03PS1) 10Brouberol: Drop kafka-jumbo100[1-6].eqiad.wmnet from the puppet site [puppet] - 10https://gerrit.wikimedia.org/r/967240 (https://phabricator.wikimedia.org/T336044) [15:52:38] (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967236 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:53:00] (03CR) 10Ssingh: [V: 03+1 C: 04-2] "Test commit for running PCC: do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/967239 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh) [15:56:11] (03CR) 10Herron: [C: 03+2] pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:56:43] (03PS8) 10Herron: pyrra: add logstash requests slo [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) [15:57:18] (03CR) 10Cwhite: [V: 03+1 C: 03+1] profile::mediawiki::common: set default histogram buckets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [15:58:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:20] (03CR) 10Herron: [C: 03+2] pyrra: add logstash requests slo [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [16:06:12] (03CR) 10Volans: [C: 04-1] "This does not remove the hosts from puppetdb and hence the cumin aliases" [puppet] - 10https://gerrit.wikimedia.org/r/967240 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [16:06:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:07:55] (03PS1) 10Kosta Harlan: ipoid: Enable the cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/967243 (https://phabricator.wikimedia.org/T346861) [16:08:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:08:31] (03CR) 10Kosta Harlan: [C: 04-2] "Not ready yet" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967243 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:09:16] (03PS10) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [16:11:09] (03PS1) 10Kosta Harlan: ipoid: Set an initialImport cron job [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861) [16:12:40] (03PS1) 10Jforrester: wikifunctions: Bump evaluators to noisy-logged ones [deployment-charts] - 10https://gerrit.wikimedia.org/r/967246 (https://phabricator.wikimedia.org/T343829) [16:12:57] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump evaluators to noisy-logged ones [deployment-charts] - 10https://gerrit.wikimedia.org/r/967246 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester) [16:13:46] (03Merged) 10jenkins-bot: wikifunctions: Bump evaluators to noisy-logged ones [deployment-charts] - 10https://gerrit.wikimedia.org/r/967246 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester) [16:14:25] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:15:07] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:15:41] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:16:32] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:16:37] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:17:05] (03PS2) 10Kosta Harlan: [WIP] ipoid: Set an initialImport cron job [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861) [16:17:27] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:18:04] (03PS1) 10BryanDavis: striker: Bump container version to 2023-10-19-160227-production [puppet] - 10https://gerrit.wikimedia.org/r/967247 (https://phabricator.wikimedia.org/T348131) [16:18:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:21:41] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 401 bytes in 5.513 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:24:21] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:24:51] (03PS1) 10Jforrester: wikifunctions: Change the staging JS evaluator over to WASM [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249 [16:26:08] (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/output/967247/90/" [puppet] - 10https://gerrit.wikimedia.org/r/967247 (https://phabricator.wikimedia.org/T348131) (owner: 10BryanDavis) [16:30:50] (03PS2) 10Jforrester: wikifunctions: Change the staging JS evaluator over to WASM [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249 [16:34:31] (03PS3) 10Jforrester: wikifunctions: Change the staging JS evaluator over to WASM instead of a special service [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249 [16:44:39] (03PS3) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) [16:44:41] (03PS11) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [16:47:41] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Change the staging JS evaluator over to WASM instead of a special service [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249 (owner: 10Jforrester) [16:48:33] (03Merged) 10jenkins-bot: wikifunctions: Change the staging JS evaluator over to WASM instead of a special service [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249 (owner: 10Jforrester) [16:49:54] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) Adding a note that the bad files are `/etc/cassandra-a/tls/server.trust` from C:cassandra #line 443 ` lang=puppe... [16:49:59] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:50:52] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:51:02] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) >>! In T328872#9263825, @tstarling wrote: > Since the cause... [16:51:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:51:10] (03PS4) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) [16:51:12] (03PS12) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [16:51:48] (03PS1) 10Genoveva Galarza: [WIP][wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967254 [16:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:59:03] (03PS5) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) [16:59:05] (03PS13) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1700) [17:02:24] (03CR) 10Fabfur: haproxy: remove multiple backends choice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur) [17:03:08] (03CR) 10Bking: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:12:14] (03PS8) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [17:15:57] PROBLEM - BGP status on ssw1-a8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Active - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:16:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:16:45] (03PS1) 10Dwisehaupt: Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [17:17:11] (03CR) 10CI reject: [V: 04-1] Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [17:18:16] (03PS2) 10Dwisehaupt: Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [17:18:44] (03CR) 10CI reject: [V: 04-1] Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [17:20:04] (03PS3) 10Dwisehaupt: Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [17:20:05] RECOVERY - BGP status on ssw1-a8-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:20:30] (03CR) 10CI reject: [V: 04-1] Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [17:21:51] (03PS1) 10Mabualruz: Make Vector feature flags compatible with beta features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967259 (https://phabricator.wikimedia.org/T347772) [17:27:39] PROBLEM - cassandra-a service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:28:07] PROBLEM - cassandra-c service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:28:11] PROBLEM - cassandra-b service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:28:25] ^^^ ignore, I should have had those downtimed [17:28:58] ok! [17:33:50] !log Decommissioning Cassandra, restbase1018-{a,b,c} — T328490 [17:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:54] T328490: restbase cluster: decommission end-of-life hosts - https://phabricator.wikimedia.org/T328490 [17:35:00] (03PS4) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [17:35:27] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [17:39:06] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/96/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur) [17:42:26] (03PS5) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [17:42:53] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [17:45:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:55:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [18:00:05] brennen and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1800). [18:00:32] o/ [18:01:58] (03CR) 10AOkoth: [C: 03+2] vrts: add new required packages v6.3.4 [puppet] - 10https://gerrit.wikimedia.org/r/966279 (https://phabricator.wikimedia.org/T348987) (owner: 10AOkoth) [18:02:10] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967262 (https://phabricator.wikimedia.org/T348354) [18:02:12] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967262 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [18:02:54] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967262 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [18:08:48] (03CR) 10Jdlrobson: [C: 03+1] Make Vector feature flags compatible with beta features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967259 (https://phabricator.wikimedia.org/T347772) (owner: 10Mabualruz) [18:09:16] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.1 refs T348354 [18:09:30] T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354 [18:09:37] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [18:10:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [18:14:37] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [18:15:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [18:17:55] (03PS1) 10Jforrester: wikifunctions: Bump staging image for better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967263 [18:20:05] ^ hrm. dead letters did spike around deploy time. unclear to me what that means, though. [18:20:26] (03PS6) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [18:20:34] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump staging image for better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967263 (owner: 10Jforrester) [18:21:24] (03Merged) 10jenkins-bot: wikifunctions: Bump staging image for better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967263 (owner: 10Jforrester) [18:22:10] Hello, is there a deploy happening right now? I was wondering if it's ok to backport a beta-cluster config change outside the backport window? (patch in question: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/967259/) [18:22:13] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:22:54] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:22:55] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:23:23] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:07] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:29] jouncebot: now [18:27:29] For the next 1 hour(s) and 32 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1800) [18:28:21] brennen: can you help jan_drewniak with deployment? [18:28:52] Beta only but obviously not stepping on your train [18:33:41] jan_drewniak: Train promotion already happened so you should be good to go [18:34:38] dancy: ok thanks, I'll go ahead with the beta-config change now then. [18:35:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967259 (https://phabricator.wikimedia.org/T347772) (owner: 10Mabualruz) [18:35:45] (03Merged) 10jenkins-bot: Make Vector feature flags compatible with beta features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967259 (https://phabricator.wikimedia.org/T347772) (owner: 10Mabualruz) [18:36:34] ok so beta-config changes *are* automatically synced... I wasn't sure about that [18:38:34] jan_drewniak: you still need to scap pull [18:38:57] `scap backport` handles all of the details. [18:39:01] Oh you did scap backport [18:39:06] RhinosF1: right, I ran `scap backport 967259` which did that pull :) [18:39:07] Ye that's fancy enough to be smart [18:39:48] So yes all magic happens, beta will deploy itself [18:39:55] yeah I love it, after the pull it told me: `18:35:57 Skipping sync since all commits were beta/labs-only changes. Operation completed.` [18:40:10] thanks! [18:40:18] Scap backport is an amazing thing [18:51:30] jan_drewniak, RhinosF1: sorry i missed the ping here earlier - been digging around in logstash. [18:51:59] brennen: no problem, dancy confirmed floor was clear [19:01:52] (03PS1) 10Herron: pyrra-filesystem: enable generic recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967273 (https://phabricator.wikimedia.org/T302995) [19:03:35] (03PS1) 10HMonroy: PhonosButton: use text() instead of append() [extensions/Phonos] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967188 (https://phabricator.wikimedia.org/T349312) [19:03:38] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:03:51] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/97/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967273 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:09:30] 10SRE, 10Maps: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Platonides) What Referer would be provided by such app? Would the requests from the app have a User-Agent identifying it? Which one? [19:10:31] (03CR) 10Herron: [V: 03+1 C: 03+2] pyrra-filesystem: enable generic recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967273 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:11:29] (03PS3) 10Sohom Datta: InitialiseSettings-labs: Set values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [19:15:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:16:55] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:39] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/967280 [19:23:01] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/98/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967280 (owner: 10Herron) [19:25:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:27:03] (03PS1) 10Ryan Kemper: elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) [19:28:25] (03PS2) 10Ryan Kemper: elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) [19:30:41] (03PS7) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [19:31:03] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/99/console" [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) (owner: 10Ryan Kemper) [19:31:44] (03Abandoned) 10Bartosz Dziewoński: DNM: null edit CI test [extensions/DiscussionTools] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966242 (owner: 10C. Scott Ananian) [19:32:07] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:07] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:36:36] (03PS2) 10Herron: pyrra::filesystem: remove [] from slo definition [puppet] - 10https://gerrit.wikimedia.org/r/967280 [19:37:15] (03PS3) 10Ryan Kemper: elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) [19:38:24] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/100/console" [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) (owner: 10Ryan Kemper) [19:39:31] (03CR) 10Ebernhardson: [C: 03+1] elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) (owner: 10Ryan Kemper) [19:39:44] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) (owner: 10Ryan Kemper) [19:40:06] (03CR) 10Herron: [C: 03+2] pyrra::filesystem: remove [] from slo definition [puppet] - 10https://gerrit.wikimedia.org/r/967280 (owner: 10Herron) [19:40:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:45:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:47:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:50:25] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:56:34] (03PS1) 10Herron: thanos: update reload endpoint to reflect updated web prefix [puppet] - 10https://gerrit.wikimedia.org/r/967291 (https://phabricator.wikimedia.org/T349102) [19:57:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:58:14] (03CR) 10Herron: "I noticed reload started throwing a 404" [puppet] - 10https://gerrit.wikimedia.org/r/967291 (https://phabricator.wikimedia.org/T349102) (owner: 10Herron) [20:00:05] brennen and TheresNoTime: Dear deployers, time to do the UTC late backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T2000). [20:02:02] o/ [20:02:08] !log utc late backport window: no patches [20:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:14] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:08:02] (03PS1) 10BCornwall: Add Prometheus metrics for fifo-log-demux [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/967293 [20:11:59] Hey brennen and TheresNoTime - was wondering if I could get a security patch out now (saw there was a backport training session today…) [20:12:18] sbassett: No patches scheduled so should be good to go. [20:13:15] tx, James_F [20:15:27] ^ +1 should be fine, brennen and I staring at logs :) [20:16:38] Ok, I might not have a deploy after all as I need a question answered for T336027. Sorry about that :) [20:16:56] (03PS2) 10BCornwall: Add Prometheus metrics for fifo-log-demux [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/967293 (https://phabricator.wikimedia.org/T345939) [20:17:23] (03PS1) 10Bartosz Dziewoński: Don't remove current wiki family from $wgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295 [20:19:43] (SystemdUnitFailed) firing: puppet-agent-timer.service Failed on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:28] PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:00] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:22] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:35] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [20:39:42] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [20:39:52] (03PS1) 10Bartosz Dziewoński: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 [20:39:54] (03PS1) 10Bartosz Dziewoński: Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 [20:40:38] RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:54] (03CR) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [20:41:07] (03CR) 10CI reject: [V: 04-1] Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 (owner: 10Bartosz Dziewoński) [20:42:18] (03CR) 10Bartosz Dziewoński: "I'm trying to split off some bits of this into separate changes that I wouldn't be very afraid to get deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [20:43:57] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) Hi @VirginiaPoundstone! Thanks for the detailed questions! I'll try to answer them one by one :) > 1. Who is the audien... [20:44:27] (03PS2) 10Bartosz Dziewoński: Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 [20:44:42] (SystemdUnitFailed) resolved: puppet-agent-timer.service Failed on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:20] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10aaron) I wonder if the auth token just expired while the combined... [20:48:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:49:14] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10aaron) >>! In T341007#9251361, @Beao wrote: > Am I experiencing the same pro... [20:52:23] (03PS9) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [20:53:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:55:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye [21:01:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:06:47] RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:42] (03PS10) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [21:11:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:12:49] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [21:12:55] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [21:22:32] (03CR) 10HMonroy: [C: 03+2] PhonosButton: use text() instead of append() [extensions/Phonos] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967188 (https://phabricator.wikimedia.org/T349312) (owner: 10HMonroy) [21:23:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hmonroy@deploy2002 using scap backport" [extensions/Phonos] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967188 (https://phabricator.wikimedia.org/T349312) (owner: 10HMonroy) [21:24:19] (03PS11) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [21:25:03] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [21:25:25] (03Merged) 10jenkins-bot: PhonosButton: use text() instead of append() [extensions/Phonos] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967188 (https://phabricator.wikimedia.org/T349312) (owner: 10HMonroy) [21:25:44] !log hmonroy@deploy2002 Started scap: Backport for [[gerrit:967188|PhonosButton: use text() instead of append() (T349312)]] [21:26:43] (03PS12) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [21:27:00] !log hmonroy@deploy2002 hmonroy: Backport for [[gerrit:967188|PhonosButton: use text() instead of append() (T349312)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:15] !log hmonroy@deploy2002 hmonroy: Continuing with sync [21:27:27] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:28] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [21:28:31] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye [21:29:05] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2045 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:31:51] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2044 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:32:33] !log hmonroy@deploy2002 Finished scap: Backport for [[gerrit:967188|PhonosButton: use text() instead of append() (T349312)]] (duration: 06m 48s) [21:41:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:42:35] (03CR) 10JHathaway: "Is there any way to review what code changed? Why is the old code not being deleted? I don't really have any context on the function, that" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [21:44:36] (03CR) 10JHathaway: compile_redirects: port compile_redirects to new API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [21:49:55] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:50:23] PROBLEM - BFD status on cr1-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:51:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:53:13] (03PS13) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [21:53:58] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [22:04:39] (03PS14) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [22:08:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [22:24:53] RECOVERY - BFD status on cr1-esams is OK: UP: 5 AdminDown: 3 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:25:49] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:28:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [22:28:19] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:28:25] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:36:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED [22:37:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED [22:38:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [22:48:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [23:00:17] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:00:21] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:38] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:51:11] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [23:56:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [23:58:14] (03CR) 10Krinkle: [C: 03+1] "LGTM. Feel free to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966945 (https://phabricator.wikimedia.org/T47514) (owner: 10Tim Starling)