[00:22:07] <icinga-wm>	 PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:29:06] <wikibugs>	 (03CR) 10Cwhite: "Looking at the full diff, there appears to be a set of quantiles configured as well.  Probably don't need those." [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron)
[00:29:53] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron)
[00:30:13] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] thanos: allow thanos-rule to serve /rule [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[00:30:45] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] thanos: reverse-proxy /rule to rule-hosts [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[00:33:03] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[00:33:07] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[00:33:20] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) Since the cause is not the same old failure of the prox...
[00:39:00] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966829
[00:39:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966829 (owner: 10TrainBranchBot)
[00:53:43] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966829 (owner: 10TrainBranchBot)
[00:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[00:57:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[01:07:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[01:17:15] <icinga-wm>	 RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:21:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[01:46:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[01:57:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[02:04:23] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:05:59] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:06:23] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:07:37] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:08:29] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:08:39] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:12:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[02:30:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[02:38:37] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:03:37] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:51:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[03:56:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[04:14:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[04:19:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[04:20:25] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:20:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:12:12] <wikibugs>	 (03CR) 10Santhosh: Update cxserver to 2023-10-12-080927-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry)
[05:12:57] <logmsgbot>	 !log tchin@deploy2002 Started deploy [airflow-dags/analytics@60950f6]: Deploying airflow [data-engineering/airflow-dags@60950f6b]
[05:14:01] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[05:14:07] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[05:14:09] <logmsgbot>	 !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@60950f6]: Deploying airflow [data-engineering/airflow-dags@60950f6b] (duration: 01m 12s)
[05:21:07] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[05:22:37] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[05:24:41] <wikibugs>	 (03Abandoned) 10Tim Starling: sshd: Disable keyboard-interactive authentication [puppet] - 10https://gerrit.wikimedia.org/r/956983 (owner: 10Tim Starling)
[05:35:08] <wikibugs>	 (03PS3) 10Tim Starling: Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989)
[05:36:19] <wikibugs>	 (03PS2) 10Tim Starling: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989)
[05:42:32] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1231 [puppet] - 10https://gerrit.wikimedia.org/r/966943
[05:43:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1231 [puppet] - 10https://gerrit.wikimedia.org/r/966943 (owner: 10Marostegui)
[05:48:55] <wikibugs>	 (03CR) 10Tim Starling: "I went for an 80 day expiry with 8 day buckets after reviewing https://foundation.wikimedia.org/wiki/Legal:Data_retention_guidelines . May" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[05:52:24] <wikibugs>	 (03PS1) 10Tim Starling: Enable source maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966945 (https://phabricator.wikimedia.org/T47514)
[05:58:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T0600)
[06:00:06] <jouncebot>	 kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T0600).
[06:03:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[06:18:23] <wikibugs>	 (03CR) 10Volans: [C: 03+2] documentation: expand distributed locking docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/966886 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[06:18:51] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[06:19:17] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:19:23] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:25:01] <wikibugs>	 (03Merged) 10jenkins-bot: documentation: expand distributed locking docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/966886 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[06:28:01] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:28:02] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Thanks John for fixing the tests for me!" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[06:29:23] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:30:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:31:16] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet
[06:31:28] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet
[06:32:29] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet
[06:32:57] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet
[06:34:45] <volans>	 !log enabled distributed locking support in spicerack/cookbooks T341973
[06:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:49] <stashbot>	 T341973: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973
[06:38:34] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[06:40:21] <wikibugs>	 (03Merged) 10jenkins-bot: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[06:43:32] <wikibugs>	 (03PS9) 10Brouberol: Publish metrics reflecting skein certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398)
[06:43:39] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) Disributed locking is now live in Spicerack and used by the Cookbooks. For a general overview see https://doc.wikimedia.org/spicerack/master/introduction.h...
[06:43:50] <wikibugs>	 (03CR) 10Brouberol: Publish metrics reflecting skein certificate expiry (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[06:49:09] <wikibugs>	 (03CR) 10Elukey: ml-services: deploy nllb in llm namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[06:49:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:50:11] <wikibugs>	 (03CR) 10Elukey: ml-services: deploy nllb in llm namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[06:51:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[06:52:18] <wikibugs>	 (03CR) 10Elukey: "Revoking the +1, I am trying to check one thing" [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[06:56:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: allow thanos-rule to serve /rule [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[06:56:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: reverse-proxy /rule to rule-hosts [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[06:57:52] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[06:58:09] <wikibugs>	 (03CR) 10Elukey: "So I checked https://puppet-compiler.wmflabs.org/output/966497/2522/kafka-jumbo1010.eqiad.wmnet/index.html and this will happen to all the" [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T0700).
[07:00:06] <jouncebot>	 tgr: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:20] <apergos>	 out with covid, cannot manage th e window
[07:01:31] <elukey>	 ouch take care apergos!
[07:01:38] <apergos>	 ty
[07:01:51] <wikibugs>	 (03PS1) 10Slyngshede: puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128
[07:03:37] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:03:39] <RhinosF1>	 Get well soon apergos
[07:06:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: set external-prefix for rule [puppet] - 10https://gerrit.wikimedia.org/r/967129 (https://phabricator.wikimedia.org/T349102)
[07:07:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: set external-prefix for rule [puppet] - 10https://gerrit.wikimedia.org/r/967129 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[07:08:01] <wikibugs>	 (03CR) 10Slyngshede: "I feel like this would be handy, but let me know if there's a reason as to why we shouldn't include labels in summaries." [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede)
[07:13:36] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[07:14:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[07:14:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965879 (https://phabricator.wikimedia.org/T342475) (owner: 10Gergő Tisza)
[07:15:07] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Make temp user config SUL-friendly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965879 (https://phabricator.wikimedia.org/T342475) (owner: 10Gergő Tisza)
[07:16:17] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[07:16:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, also something to keep in mind that when alerts get aggregated in groups the number of alerts will show up on irc, though only one i" [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede)
[07:17:25] <tgr>	 !log UTC morning deploys done
[07:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:03] <wikibugs>	 (03PS8) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312)
[07:20:26] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2109.codfw.wmnet with reason: db2109 downtime while repooling
[07:20:28] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2109.codfw.wmnet with reason: db2109 downtime while repooling
[07:21:11] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/60/cons" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[07:23:39] <wikibugs>	 (03CR) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[07:33:56] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[07:41:31] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Publish metrics reflecting skein certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[07:43:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I don't think we should be spending time on graphite, having said that feel free to try" [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron)
[07:43:27] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163)
[07:43:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: deploy nllb in llm namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[07:45:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[07:47:23] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:50:33] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:54:02] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:monitoring absent Incinga check_eth. [puppet] - 10https://gerrit.wikimedia.org/r/966535 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede)
[07:56:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[07:59:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] pyrra: add logstash requests slo [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[08:00:05] <jouncebot>	 brennen and hashar: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T0800).
[08:00:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[08:06:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:07:29] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[08:10:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[08:12:17] <icinga-wm>	 PROBLEM - Check systemd state on db2132 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-mysqld-exporter.service,wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:20:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[08:21:56] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[08:22:48] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[08:28:51] <icinga-wm>	 PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:33:34] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:34:04] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:35:23] <wikibugs>	 (03CR) 10Ayounsi: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (0312 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney)
[08:36:11] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
[08:41:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[08:45:12] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:46:15] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[08:46:32] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::k8s: drop unused Istio labels [puppet] - 10https://gerrit.wikimedia.org/r/967140 (https://phabricator.wikimedia.org/T349072)
[08:51:09] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/71/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967140 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey)
[08:51:10] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:58:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[09:02:34] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Nicolas_Raoul)
[09:03:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:04:29] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] [BETA HACK] Attempt to secure Puppet DB better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle)
[09:05:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:09:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede)
[09:09:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712)
[09:11:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[09:12:25] <icinga-wm>	 RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:13:07] <wikibugs>	 (03CR) 10Jbond: "lgtm nit/question inline" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[09:14:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712)
[09:14:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: test: expand 'runbook not found' assertion [alerts] - 10https://gerrit.wikimedia.org/r/967144
[09:15:09] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:15:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] test: expand 'runbook not found' assertion [alerts] - 10https://gerrit.wikimedia.org/r/967144 (owner: 10Filippo Giunchedi)
[09:15:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[09:18:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[09:20:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:20:57] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update rec-api-ng resource limits to match wmflabs [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475)
[09:30:51] <wikibugs>	 (03CR) 10Fabfur: [C: 04-1] haproxy: enable healthcheck-dedicated backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur)
[09:32:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] haproxy: enable healthcheck-dedicated backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur)
[09:32:18] <wikibugs>	 (03PS1) 10Jelto: miscweb: remove the use of :latest image tag in httpd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856)
[09:32:48] <wikibugs>	 (03PS6) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[09:35:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/966221/comments/9b08a162_1d01cf19 is still pending" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur)
[09:37:56] <wikibugs>	 (03CR) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[09:38:21] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 04-1] profile::prometheus::k8s: drop unused Istio labels [puppet] - 10https://gerrit.wikimedia.org/r/967140 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey)
[09:40:26] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis)
[09:43:21] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis)
[09:49:12] <wikibugs>	 (03CR) 10Effie Mouzeli: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[09:55:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[10:00:05] <jouncebot>	 mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1000).
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1000)
[10:00:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[10:00:41] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:25] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:02:03] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:34] <wikibugs>	 (03PS1) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967148 (https://phabricator.wikimedia.org/T236373)
[10:07:36] <wikibugs>	 (03PS1) 10Jbond: puppet: simplify debug code [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967149
[10:07:38] <wikibugs>	 (03PS1) 10Jbond: tox: remove envdir optimizations [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967150 (https://phabricator.wikimedia.org/T348434)
[10:07:40] <wikibugs>	 (03PS1) 10Jbond: tox: add commands to allowlist_externals [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967151
[10:07:42] <wikibugs>	 (03PS1) 10Jbond: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967152
[10:07:44] <wikibugs>	 (03PS1) 10Jbond: debug_presentation: script to render HTML templates [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967153
[10:07:46] <wikibugs>	 (03PS1) 10Jbond: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967154
[10:07:48] <wikibugs>	 (03PS1) 10Jbond: Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155
[10:11:47] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:13:06] <wikibugs>	 (03PS2) 10Jbond: puppet: simplify debug code [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967149
[10:13:08] <wikibugs>	 (03PS2) 10Jbond: tox: remove envdir optimizations [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967150 (https://phabricator.wikimedia.org/T348434)
[10:13:10] <wikibugs>	 (03PS2) 10Jbond: tox: add commands to allowlist_externals [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967151
[10:13:12] <wikibugs>	 (03PS2) 10Jbond: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967152
[10:13:14] <wikibugs>	 (03PS2) 10Jbond: debug_presentation: script to render HTML templates [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967153
[10:13:16] <wikibugs>	 (03PS2) 10Jbond: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967154
[10:13:18] <wikibugs>	 (03PS2) 10Jbond: Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155
[10:16:57] <wikibugs>	 (03PS9) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312)
[10:18:21] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/80/cons" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[10:19:07] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: fix nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/967156 (https://phabricator.wikimedia.org/T349163)
[10:20:37] <wikibugs>	 (03CR) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[10:23:10] <wikibugs>	 (03PS1) 10Jbond: html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157
[10:25:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[10:26:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: fix nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/967156 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[10:27:16] <wikibugs>	 (03PS2) 10Jbond: html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157
[10:33:54] <wikibugs>	 (03CR) 10Elukey: ml-services: update rec-api-ng resource limits to match wmflabs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[10:41:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[10:43:33] <wikibugs>	 (03PS3) 10Jbond: Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155
[10:43:35] <wikibugs>	 (03PS3) 10Jbond: html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157
[10:51:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet: simplify debug code [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967149 (owner: 10Jbond)
[10:51:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] tox: remove envdir optimizations [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967150 (https://phabricator.wikimedia.org/T348434) (owner: 10Jbond)
[10:51:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] tox: add commands to allowlist_externals [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967151 (owner: 10Jbond)
[10:51:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967152 (owner: 10Jbond)
[10:51:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] debug_presentation: script to render HTML templates [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967153 (owner: 10Jbond)
[10:51:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Use macros for links to Gerrit and Jenkins [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967154 (owner: 10Jbond)
[10:51:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155 (owner: 10Jbond)
[10:51:33] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: update rec-api-ng resource limits to match wmflabs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[10:51:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157 (owner: 10Jbond)
[10:53:14] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: simplify debug code [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967149 (owner: 10Jbond)
[10:54:46] <wikibugs>	 (03Merged) 10jenkins-bot: tox: remove envdir optimizations [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967150 (https://phabricator.wikimedia.org/T348434) (owner: 10Jbond)
[10:54:48] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[10:54:51] <wikibugs>	 (03Abandoned) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[10:54:54] <wikibugs>	 (03PS1) 10Jbond: 2.6.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967162
[10:54:58] <wikibugs>	 (03Merged) 10jenkins-bot: tox: add commands to allowlist_externals [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967151 (owner: 10Jbond)
[10:55:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] 2.6.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967162 (owner: 10Jbond)
[10:55:33] <wikibugs>	 (03Merged) 10jenkins-bot: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967152 (owner: 10Jbond)
[10:55:35] <wikibugs>	 (03Merged) 10jenkins-bot: debug_presentation: script to render HTML templates [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967153 (owner: 10Jbond)
[10:55:37] <wikibugs>	 (03Merged) 10jenkins-bot: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967154 (owner: 10Jbond)
[10:55:39] <wikibugs>	 (03Merged) 10jenkins-bot: Add style to HTML output [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967155 (owner: 10Jbond)
[10:55:41] <wikibugs>	 (03Merged) 10jenkins-bot: html: update html to wrap parameters [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967157 (owner: 10Jbond)
[10:56:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[10:58:49] <wikibugs>	 (03Merged) 10jenkins-bot: 2.6.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967162 (owner: 10Jbond)
[11:01:36] <wikibugs>	 (03PS1) 10Jbond: Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967163
[11:02:35] <wikibugs>	 (03PS1) 10Volans: sre.deploy.python-code: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967164 (https://phabricator.wikimedia.org/T341973)
[11:02:37] <wikibugs>	 (03PS1) 10Volans: sre.discovery.datacenter: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973)
[11:02:39] <wikibugs>	 (03PS1) 10Volans: sre.discovery.service-route: customize lock args [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973)
[11:02:41] <wikibugs>	 (03PS1) 10Volans: sre.dns.netbox: make the lock exclusive [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973)
[11:03:37] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:04:01] <wikibugs>	 (03CR) 10Jbond: "I ended up pointing theses at the 2.x branch in" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar)
[11:04:19] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:27] <wikibugs>	 (03Abandoned) 10Jbond: tox: remove envdir optimizations [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434) (owner: 10Hashar)
[11:04:35] <wikibugs>	 (03Abandoned) 10Jbond: tox: add commands to allowlist_externals [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966895 (owner: 10Hashar)
[11:04:41] <wikibugs>	 (03Abandoned) 10Jbond: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822 (owner: 10Hashar)
[11:04:47] <wikibugs>	 (03Abandoned) 10Jbond: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[11:04:54] <wikibugs>	 (03Abandoned) 10Jbond: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 (owner: 10Hashar)
[11:04:59] <wikibugs>	 (03Abandoned) 10Jbond: Add style to HTML output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar)
[11:05:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967163 (owner: 10Jbond)
[11:08:10] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/967170
[11:08:19] <wikibugs>	 (03CR) 10Volans: sre.discovery.datacenter: customize lock arguments (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:08:40] <wikibugs>	 (03Merged) 10jenkins-bot: Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967163 (owner: 10Jbond)
[11:08:47] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "This is just a proposal" [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:09:15] <wikibugs>	 (03CR) 10Volans: "LMK if you think this is too strict" [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:09:24] <wikibugs>	 10SRE, 10Traffic: HAProxy should use a single backend for Vanish - https://phabricator.wikimedia.org/T349287 (10Fabfur)
[11:09:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/967170 (owner: 10Jbond)
[11:10:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: test: deal with private runbooks [alerts] - 10https://gerrit.wikimedia.org/r/967144
[11:10:24] <wikibugs>	 (03PS3) 10Filippo Giunchedi: sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712)
[11:11:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] test: deal with private runbooks [alerts] - 10https://gerrit.wikimedia.org/r/967144 (owner: 10Filippo Giunchedi)
[11:12:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[11:13:07] <wikibugs>	 (03PS7) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[11:13:09] <wikibugs>	 (03PS1) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287)
[11:13:26] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond)
[11:15:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:44] <wikibugs>	 (03PS3) 10Filippo Giunchedi: test: deal with private runbooks [alerts] - 10https://gerrit.wikimedia.org/r/967144
[11:15:46] <wikibugs>	 (03PS4) 10Filippo Giunchedi: sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712)
[11:15:49] <wikibugs>	 (03PS8) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[11:19:38] <wikibugs>	 (03PS1) 10Jelto: kubernetes::deployment_server: add common_image for httpd exporter [puppet] - 10https://gerrit.wikimedia.org/r/967174 (https://phabricator.wikimedia.org/T348856)
[11:19:49] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM.  Nice one :)" [alerts] - 10https://gerrit.wikimedia.org/r/967144 (owner: 10Filippo Giunchedi)
[11:21:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond)
[11:23:55] <icinga-wm>	 RECOVERY - Check systemd state on db2132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:11] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/967156 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[11:25:04] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/967156 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[11:25:24] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "looking good" [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur)
[11:25:44] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond)
[11:29:19] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "My suggestion is to limit the number of results that rec api is processing. Instead of 500 we can fetch 250/200/100 results/candidates as " [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[11:30:07] <wikibugs>	 (03PS2) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287)
[11:30:09] <wikibugs>	 (03PS9) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[11:30:11] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 28): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/82/console" [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:30:50] <wikibugs>	 (03CR) 10Fabfur: haproxy: remove multiple backends choice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur)
[11:30:54] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
[11:31:13] <wikibugs>	 (03CR) 10Fabfur: haproxy: enable healthcheck-dedicated backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur)
[11:32:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/967164 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:32:59] <wikibugs>	 (03PS2) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601)
[11:33:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, left the open questions to service ops but FTR 1 seems like the correct concurrency to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:35:59] <wikibugs>	 (03CR) 10Jbond: sre.discovery.service-route: customize lock args (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:36:01] <wikibugs>	 (03CR) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (0312 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney)
[11:36:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:38:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[11:40:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sre.dns.netbox: make the lock exclusive [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:43:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[11:44:59] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@6f09297] (releasing): (no justification provided)
[11:46:07] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@6f09297] (releasing): (no justification provided) (duration: 01m 08s)
[11:47:28] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
[11:52:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond)
[11:53:02] <wikibugs>	 (03PS1) 10Jbond: elasticsearch::relforge: remove trailing comma [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291)
[11:53:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[11:53:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond)
[11:54:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): re-write compile_redirects function - https://phabricator.wikimedia.org/T348883 (10jbond)
[11:54:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond)
[11:54:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[11:54:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond)
[11:54:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): re-write compile_redirects function - https://phabricator.wikimedia.org/T348883 (10jbond) 05Open→03Stalled p:05Triage→03Medium
[11:54:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[11:54:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/83/console" [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond)
[11:55:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) 05Open→03In progress p:05Triage→03Medium
[11:55:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond)
[12:00:07] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1200)
[12:06:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:17:45] <wikibugs>	 (03PS19) 10Brouberol: Define environment variables to ease the use of prometheus-metricsfetcher [puppet] - 10https://gerrit.wikimedia.org/r/967134
[12:20:00] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Remove kafka-jumbo100[1-6] brokers from the inventory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[12:20:33] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[12:23:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[12:24:37] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.deploy.python-code: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967164 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[12:25:00] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.dns.netbox: make the lock exclusive [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[12:27:26] <wikibugs>	 (03CR) 10Ayounsi: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney)
[12:27:31] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2015 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:28:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[12:28:36] <wikibugs>	 (03Merged) 10jenkins-bot: sre.deploy.python-code: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967164 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[12:28:51] <wikibugs>	 (03PS2) 10Volans: sre.dns.netbox: make the lock exclusive [cookbooks] - 10https://gerrit.wikimedia.org/r/967167 (https://phabricator.wikimedia.org/T341973)
[12:29:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:30:32] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: update rec-api-ng resource limits to match wmflabs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[12:33:06] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Update comment about EditAttemptStep instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208
[12:34:00] <wikibugs>	 (03PS5) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565)
[12:34:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:34:04] <wikibugs>	 (03PS5) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565)
[12:34:13] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:34:55] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "@Phuedx You added the comment in I070d826f63dae9e882137fd3d9bb3a76f6622a50. To be honest, I don't really understand it – the values listed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński)
[12:37:07] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:39:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:40:13] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[12:41:19] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1074 days) https://wikitech.wikimedia.org/wiki/Logs
[12:42:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:42:13] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:44:19] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:44:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2015 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:44:55] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.729 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:45:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:45:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:46:05] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:50:39] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[12:50:39] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[12:50:45] <logmsgbot>	 !log volans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[12:50:46] <logmsgbot>	 !log volans@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[12:50:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:52:06] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[12:52:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2045 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:55:41] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:09] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2044 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:59:25] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: noop - volans@cumin1001"
[12:59:37] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/84/console" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[12:59:55] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1300)
[13:00:05] <jouncebot>	 dcausse and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:15] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: noop - volans@cumin1001"
[13:00:15] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:00:28] <dcausse>	 o/
[13:00:42] <aanzx>	 o/
[13:02:35] <wikibugs>	 (03Abandoned) 10Jforrester: jquery.tablesorter: Fix data-sort-type with numeric values [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965690 (https://phabricator.wikimedia.org/T348812) (owner: 10Jforrester)
[13:03:52] * TheresNoTime is unable to deploy
[13:05:23] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[13:13:27] <wikibugs>	 (03PS2) 10Anzx: hiwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967213 (https://phabricator.wikimedia.org/T310961)
[13:14:56] <wikibugs>	 10SRE-OnFire, 10Cloud-VPS, 10cloud-services-team, 10Sustainability (Incident Followup), 10User-dcaro: openstack: create a cookbook to inject commands to VMs via console at scale - https://phabricator.wikimedia.org/T347683 (10taavi) a:03taavi
[13:15:11] <wikibugs>	 (03PS1) 10Hashar: Add a json representation of the build [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214
[13:16:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[13:17:09] <wikibugs>	 (03CR) 10Hashar: "The build.json is the counter part of the build index. Theorically I can then integrate those data into the Gerrit check tab so people can" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 (owner: 10Hashar)
[13:18:11] <wikibugs>	 (03CR) 10Hashar: "I should ideally add tests to cover `presentation.json.Build`" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 (owner: 10Hashar)
[13:21:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[13:22:47] <wikibugs>	 (03CR) 10Ottomata: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[13:23:34] <wikibugs>	 (03PS6) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565)
[13:23:50] <wikibugs>	 (03PS6) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565)
[13:24:07] <wikibugs>	 (03CR) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[13:24:09] <wikibugs>	 (03PS1) 10Btullis: Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215
[13:24:11] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede)
[13:25:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede)
[13:25:38] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] puppet-agent-fail: enable check for all clusters. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede)
[13:25:48] <wikibugs>	 (03PS2) 10Btullis: Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215
[13:27:16] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/85/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967215 (owner: 10Btullis)
[13:30:56] <WMDE-Fisch>	 I could deploy \o @ dcausse & aanzx although I got a bit rusty in it 
[13:30:57] <wikibugs>	 (03CR) 10Jbond: "this looks fine, minor nit in line.  could you also target 2.x and update the changelog" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 (owner: 10Hashar)
[13:31:29] <WMDE-Fisch>	 @ aanzx I'm missing a +1 from another reviewer on your patches though
[13:31:31] <dcausse>	 WMDE-Fisch: thanks! I have a meeting in 5min so it's fine to skip mine
[13:31:37] <WMDE-Fisch>	 kk
[13:34:47] <wikibugs>	 (03PS3) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601)
[13:34:53] <wikibugs>	 (03PS3) 10Btullis: Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215
[13:35:18] <wikibugs>	 (03CR) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney)
[13:35:30] <wikibugs>	 (03CR) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney)
[13:35:32] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Cleanup Kartographer Nearby flags (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch)
[13:35:33] <WMDE-Fisch>	 So I guess I skip aanzx patches as well. ... Only my patch left then.
[13:35:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] test: deal with private runbooks [alerts] - 10https://gerrit.wikimedia.org/r/967144 (owner: 10Filippo Giunchedi)
[13:36:13] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/86/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967215 (owner: 10Btullis)
[13:36:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Awesome!" [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney)
[13:36:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966610 (https://phabricator.wikimedia.org/T252346) (owner: 10WMDE-Fisch)
[13:36:45] <wikibugs>	 (03PS4) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601)
[13:37:11] <wikibugs>	 (03PS2) 10Slyngshede: puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128
[13:40:33] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Workaround to center search terms label"" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966610 (https://phabricator.wikimedia.org/T252346) (owner: 10WMDE-Fisch)
[13:41:00] <logmsgbot>	 !log wmde-fisch@deploy2002 Started scap: Backport for [[gerrit:966610|Revert "Revert "Workaround to center search terms label"" (T252346)]]
[13:41:05] <stashbot>	 T252346: AdvancedSearch namespace pillbox label is misaligned - https://phabricator.wikimedia.org/T252346
[13:42:32] <logmsgbot>	 !log wmde-fisch@deploy2002 wmde-fisch: Backport for [[gerrit:966610|Revert "Revert "Workaround to center search terms label"" (T252346)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:43:15] <logmsgbot>	 !log wmde-fisch@deploy2002 wmde-fisch: Continuing with sync
[13:43:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney)
[13:46:36] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966833 (https://phabricator.wikimedia.org/T347475)
[13:47:45] <kostajh>	 WMDE-Fisch: hi, can I add one for beta labs?
[13:47:56] <WMDE-Fisch>	 Sure
[13:48:00] <WMDE-Fisch>	 kostajh: 
[13:48:03] <kostajh>	 WMDE-Fisch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/965100
[13:48:07] <kostajh>	 I'll add to the calendar now
[13:48:26] <wikibugs>	 (03PS3) 10WMDE-Fisch: labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) (owner: 10Kosta Harlan)
[13:48:50] <logmsgbot>	 !log wmde-fisch@deploy2002 Finished scap: Backport for [[gerrit:966610|Revert "Revert "Workaround to center search terms label"" (T252346)]] (duration: 07m 50s)
[13:48:55] <stashbot>	 T252346: AdvancedSearch namespace pillbox label is misaligned - https://phabricator.wikimedia.org/T252346
[13:49:23] <kostajh>	 WMDE-Fisch: Added
[13:50:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) (owner: 10Kosta Harlan)
[13:51:24] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) (owner: 10Kosta Harlan)
[13:52:06] <wikibugs>	 (03PS1) 10Majavah: P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683)
[13:52:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[13:52:19] <WMDE-Fisch>	 kostajh: Done, should be synced and working.
[13:52:29] <kostajh>	 danke
[13:52:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] haproxy: remove multiple backends choice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur)
[13:52:46] <WMDE-Fisch>	 gerne :-)
[13:53:13] <kostajh>	 I think it will need a few more minutes to sync to beta cluster https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ 
[13:53:50] <WMDE-Fisch>	 Ah right. 
[13:54:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683) (owner: 10Majavah)
[13:55:15] <wikibugs>	 (03PS2) 10Majavah: P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683)
[13:56:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "you will like this one ;)" [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond)
[13:57:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[13:57:15] <wikibugs>	 (03CR) 10Slyngshede: puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede)
[13:58:28] <wikibugs>	 (03Merged) 10jenkins-bot: puppet-agent: mention instance in summary [alerts] - 10https://gerrit.wikimedia.org/r/967128 (owner: 10Slyngshede)
[13:58:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:58:37] <kostajh>	 WMDE-Fisch: and yeah, it works now that the job has run. Thanks!
[13:58:53] <WMDE-Fisch>	 Perfect. I'm out of here \o :-)
[13:59:43] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Implement JavaScript password match check. [software/bitu] - 10https://gerrit.wikimedia.org/r/966123 (owner: 10Slyngshede)
[14:00:29] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - jclark@cumin1001"
[14:01:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - jclark@cumin1001"
[14:01:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:03:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:03:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:03:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:05:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1007-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:09:43] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:11:16] <wikibugs>	 (03Abandoned) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede)
[14:12:28] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:12:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:14:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:14:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:15:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[14:15:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2015 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:15:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "i thought i had updated the commit message.  however this one is caused because" [puppet] - 10https://gerrit.wikimedia.org/r/967185 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond)
[14:16:42] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:17:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:17:27] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:17:35] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:21:33] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:24:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966833 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[14:28:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:29:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:29:50] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215 (owner: 10Btullis)
[14:31:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:31:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:31:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:32:08] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review. :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966833 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[14:32:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:33:13] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966833 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[14:34:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet1007-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:34:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:35:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[14:35:35] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:38:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:38:38] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:39:09] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] kubernetes::deployment_server: add common_image for httpd exporter [puppet] - 10https://gerrit.wikimedia.org/r/967174 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto)
[14:39:25] <vgutierrez>	 hmmm anybody working on thanos?
[14:39:41] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:39:47] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:39:48] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:39:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:40:05] <icinga-wm>	 PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:43:21] <godog>	 I think elukey is aware, titan1001 is indeed not in great shape
[14:43:38] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:44:06] <elukey>	 I am yes, rebooting titan1001 vgutierrez 
[14:44:17] <vgutierrez>	 ack
[14:44:41] <elukey>	 !log powercycle titan1001
[14:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:59] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: update rec-api-ng resource limits to match wmflabs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[14:49:11] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:49:17] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:49:27] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:33] <icinga-wm>	 RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Fri 03 Nov 2023 08:51:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:50:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:50:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:51:04] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:51:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:51:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1010-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:51:52] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1009-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:51:59] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 (owner: 10Klausman)
[14:52:16] <wikibugs>	 (03PS2) 10Klausman: images: Add Go 1.21 image, based on bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228
[14:53:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr)
[14:53:42] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED
[14:54:31] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[14:55:19] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev']
[14:55:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:55:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1009-dev']
[14:55:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1009-dev']
[14:55:41] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: add autoscaling for langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/967230 (https://phabricator.wikimedia.org/T340507)
[14:55:42] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev']
[14:55:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1009-dev']
[14:55:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1009-dev']
[14:56:17] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:56:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:56:23] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:56:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev']
[14:56:35] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev']
[14:56:39] <icinga-wm>	 PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:57:24] <elukey>	 wow again?
[14:57:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[14:58:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1009-dev.eqiad.wmnet']
[14:58:24] <elukey>	 !log powercycle titan1001
[14:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:38] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:27] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[14:59:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet1008-dev.eqiad.wmnet']
[14:59:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet1007-dev.eqiad.wmnet']
[15:00:29] <icinga-wm>	 PROBLEM - Host titan1001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:00:35] <sukhe>	 hmm
[15:00:39] <sukhe>	 oh elukey ok 
[15:01:40] <elukey>	 sukhe: yes yes it is me sorry :(
[15:01:55] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 7.105 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:01:56] <sukhe>	 np all good! was checking because I am on-call :)
[15:01:57] <icinga-wm>	 RECOVERY - Host titan1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[15:02:02] <wikibugs>	 (03CR) 10Ebernhardson: rdf-streaming-updater: update staging values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[15:02:05] <icinga-wm>	 RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Fri 03 Nov 2023 08:51:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:03:05] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:03:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:03:36] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[15:03:38] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:03:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Fabfur)
[15:04:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[15:04:45] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[15:05:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:05:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[15:06:06] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[15:07:34] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[15:07:39] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:50] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[15:07:50] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[15:07:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol1009-dev.eqiad.wmnet']
[15:08:11] <sukhe>	 elukey: is that you? sorry :) 
[15:08:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[15:08:22] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[15:08:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[15:08:29] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1010-dev.eqiad.wmnet']
[15:08:32] <denisse>	 !incidents
[15:08:32] <sirenbot>	 4136 (ACKED)  kafka-jumbo1001/Kafka Broker Server (paged)
[15:08:34] <elukey>	 sukhe: nono it is brouberol 
[15:08:35] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudnet1007-dev.eqiad.wmnet']
[15:08:37] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[15:08:37] <sukhe>	 denisse: ACKed
[15:08:38] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudnet1008-dev.eqiad.wmnet']
[15:08:39] <elukey>	 they are decomming old servers
[15:08:45] <sukhe>	 thanks elukey and sorry :P 
[15:08:47] <elukey>	 btullis, brouberol --^
[15:09:00] <Emperor>	 here, sorry, was making tea
[15:09:17] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:09:19] <elukey>	 nothing is exploding, only a false alert due to missing downtime
[15:09:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet1008-dev']
[15:09:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet1007-dev']
[15:09:28] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[15:09:28] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudnet1007-dev']
[15:09:29] <Emperor>	 ah, thanks, I'll go back to my tea
[15:09:29] <denisse>	 sukhe: it's related to the decommission, right??
[15:09:29] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudnet1008-dev']
[15:09:33] <sukhe>	 denisse: yep
[15:09:46] <Emperor>	 err, I just got p.aged again
[15:09:50] <sukhe>	 yeah
[15:09:51] <Emperor>	 !incidents
[15:09:51] <sirenbot>	 4136 (ACKED)  kafka-jumbo1001/Kafka Broker Server (paged)
[15:09:52] <sirenbot>	 4137 (ACKED)  kafka-jumbo1002/Kafka Broker Server (paged)
[15:10:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr)
[15:10:39] <brouberol>	 re kafka pages: sorry, my fault. I missed setting a silence
[15:10:48] <brouberol>	 it's all good, we stopped the services on purpose
[15:11:13] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967232 (https://phabricator.wikimedia.org/T347075)
[15:11:13] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:17] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[15:11:36] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-jumbo1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[15:11:51] <brouberol>	 I'll silence the hosts in incinga
[15:12:01] <sukhe>	 thanks brouberol!
[15:12:06] <sukhe>	 we got one more, so ACKing
[15:12:14] <brouberol>	 sorry again, my faulth
[15:12:17] <wikibugs>	 (03CR) 10DCausse: rdf-streaming-updater: update staging values (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[15:12:18] <Emperor>	 thanks. Are we going to get more?
[15:12:25] * Emperor still half-way through making this tea...
[15:12:35] <Emperor>	 I guess running up the stairs every time is good exercise
[15:13:02] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1001.eqiad.wmnet with reason: host is being decommissioned
[15:13:15] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1001.eqiad.wmnet with reason: host is being decommissioned
[15:13:18] <gehel>	 Emperor: we should be good now (everything silenced)
[15:13:25] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1002.eqiad.wmnet with reason: host is being decommissioned
[15:13:49] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1002.eqiad.wmnet with reason: host is being decommissioned
[15:13:55] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[15:13:55] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1003.eqiad.wmnet with reason: host is being decommissioned
[15:14:19] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:19] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1003.eqiad.wmnet with reason: host is being decommissioned
[15:14:22] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[15:14:26] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1004.eqiad.wmnet with reason: host is being decommissioned
[15:14:39] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1004.eqiad.wmnet with reason: host is being decommissioned
[15:14:45] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1005.eqiad.wmnet with reason: host is being decommissioned
[15:14:54] <sukhe>	 one more, ACKed
[15:14:58] <Emperor>	 !incidents
[15:14:59] <sirenbot>	 4136 (ACKED)  kafka-jumbo1001/Kafka Broker Server (paged)
[15:14:59] <sirenbot>	 4137 (ACKED)  kafka-jumbo1002/Kafka Broker Server (paged)
[15:14:59] <sirenbot>	 4138 (ACKED)  kafka-jumbo1003/Kafka Broker Server (paged)
[15:14:59] <sirenbot>	 4139 (RESOLVED)  kafka-jumbo1004/Kafka Broker Server (paged)
[15:15:10] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1005.eqiad.wmnet with reason: host is being decommissioned
[15:15:16] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on kafka-jumbo1006.eqiad.wmnet with reason: host is being decommissioned
[15:15:35] <brouberol>	 alright, I've silence all 6 hosts in icinga
[15:15:41] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on kafka-jumbo1006.eqiad.wmnet with reason: host is being decommissioned
[15:16:02] <brouberol>	 sorry again folks
[15:17:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[15:17:24] <sukhe>	 brouberol: all good!
[15:22:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) This is now fixed in esams, solution that's been applied is to add a community on sessions to LVS servers if the MED is...
[15:24:10] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967232 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[15:24:57] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967232 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[15:25:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED
[15:26:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ssingh) Thanks, confirming this is working: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=33
[15:28:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) Actually there is a caveat, traffic from other servers on asw1-bw27-esams will still route out via lvs3010, until I impl...
[15:30:06] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[15:30:19] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:31:59] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable dual disk storage for new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/967235
[15:32:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hiera: enable dual disk storage for new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/967235 (owner: 10Fabfur)
[15:33:51] <wikibugs>	 (03PS2) 10Fabfur: hiera: enable dual disk storage for new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244)
[15:34:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[15:34:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[15:38:37] <wikibugs>	 (03PS1) 10Volans: sre.puppet.sync-netbox-hiera: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967236 (https://phabricator.wikimedia.org/T341973)
[15:40:39] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[15:41:09] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [restbase/deploy@a311c5d]: (no justification provided)
[15:42:03] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@a311c5d]: (no justification provided) (duration: 00m 54s)
[15:43:11] <wikibugs>	 (03CR) 10Ssingh: hiera: enable dual disk storage for new cp hosts in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[15:46:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:39] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney)
[15:47:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/967236 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[15:48:12] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.puppet.sync-netbox-hiera: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967236 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[15:49:18] <wikibugs>	 (03PS1) 10Ssingh: hiera: add host override for cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/967239 (https://phabricator.wikimedia.org/T349244)
[15:49:23] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:40] <wikibugs>	 (03CR) 10Herron: profile::mediawiki::common: set default histogram buckets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron)
[15:50:31] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/88/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967239 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh)
[15:52:20] <wikibugs>	 (03CR) 10Ssingh: hiera: enable dual disk storage for new cp hosts in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[15:52:28] <wikibugs>	 (03PS1) 10Brouberol: Drop kafka-jumbo100[1-6].eqiad.wmnet from the puppet site [puppet] - 10https://gerrit.wikimedia.org/r/967240 (https://phabricator.wikimedia.org/T336044)
[15:52:38] <wikibugs>	 (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: fine tune lock [cookbooks] - 10https://gerrit.wikimedia.org/r/967236 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[15:53:00] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 04-2] "Test commit for running PCC: do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/967239 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh)
[15:56:11] <wikibugs>	 (03CR) 10Herron: [C: 03+2] pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[15:56:43] <wikibugs>	 (03PS8) 10Herron: pyrra: add logstash requests slo [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995)
[15:57:18] <wikibugs>	 (03CR) 10Cwhite: [V: 03+1 C: 03+1] profile::mediawiki::common: set default histogram buckets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron)
[15:58:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:00:05] <jouncebot>	 jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:20] <wikibugs>	 (03CR) 10Herron: [C: 03+2] pyrra: add logstash requests slo [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[16:06:12] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "This does not remove the hosts from puppetdb and hence the cumin aliases" [puppet] - 10https://gerrit.wikimedia.org/r/967240 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[16:06:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:07:55] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Enable the cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/967243 (https://phabricator.wikimedia.org/T346861)
[16:08:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:08:31] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] "Not ready yet" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967243 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[16:09:16] <wikibugs>	 (03PS10) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[16:11:09] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Set an initialImport cron job [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861)
[16:12:40] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Bump evaluators to noisy-logged ones [deployment-charts] - 10https://gerrit.wikimedia.org/r/967246 (https://phabricator.wikimedia.org/T343829)
[16:12:57] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump evaluators to noisy-logged ones [deployment-charts] - 10https://gerrit.wikimedia.org/r/967246 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester)
[16:13:46] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Bump evaluators to noisy-logged ones [deployment-charts] - 10https://gerrit.wikimedia.org/r/967246 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester)
[16:14:25] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:15:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:15:41] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:16:32] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:16:37] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:17:05] <wikibugs>	 (03PS2) 10Kosta Harlan: [WIP] ipoid: Set an initialImport cron job [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861)
[16:17:27] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:18:04] <wikibugs>	 (03PS1) 10BryanDavis: striker: Bump container version to 2023-10-19-160227-production [puppet] - 10https://gerrit.wikimedia.org/r/967247 (https://phabricator.wikimedia.org/T348131)
[16:18:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:21:41] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 401 bytes in 5.513 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:24:21] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:24:51] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Change the staging JS evaluator over to WASM [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249
[16:26:08] <wikibugs>	 (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/output/967247/90/" [puppet] - 10https://gerrit.wikimedia.org/r/967247 (https://phabricator.wikimedia.org/T348131) (owner: 10BryanDavis)
[16:30:50] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Change the staging JS evaluator over to WASM [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249
[16:34:31] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Change the staging JS evaluator over to WASM instead of a special service [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249
[16:44:39] <wikibugs>	 (03PS3) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287)
[16:44:41] <wikibugs>	 (03PS11) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[16:47:41] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Change the staging JS evaluator over to WASM instead of a special service [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249 (owner: 10Jforrester)
[16:48:33] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Change the staging JS evaluator over to WASM instead of a special service [deployment-charts] - 10https://gerrit.wikimedia.org/r/967249 (owner: 10Jforrester)
[16:49:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) Adding a note that the bad files are `/etc/cassandra-a/tls/server.trust` from C:cassandra  #line 443  ` lang=puppe...
[16:49:59] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:50:52] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:51:02] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) >>! In T328872#9263825, @tstarling wrote: > Since the cause...
[16:51:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:51:10] <wikibugs>	 (03PS4) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287)
[16:51:12] <wikibugs>	 (03PS12) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[16:51:48] <wikibugs>	 (03PS1) 10Genoveva Galarza: [WIP][wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967254
[16:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:59:03] <wikibugs>	 (03PS5) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287)
[16:59:05] <wikibugs>	 (03PS13) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1700)
[17:02:24] <wikibugs>	 (03CR) 10Fabfur: haproxy: remove multiple backends choice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur)
[17:03:08] <wikibugs>	 (03CR) 10Bking: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[17:12:14] <wikibugs>	 (03PS8) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[17:15:57] <icinga-wm>	 PROBLEM - BGP status on ssw1-a8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Active - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:16:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[17:16:45] <wikibugs>	 (03PS1) 10Dwisehaupt: Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[17:17:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[17:18:16] <wikibugs>	 (03PS2) 10Dwisehaupt: Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[17:18:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[17:20:04] <wikibugs>	 (03PS3) 10Dwisehaupt: Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[17:20:05] <icinga-wm>	 RECOVERY - BGP status on ssw1-a8-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:20:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checking of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[17:21:51] <wikibugs>	 (03PS1) 10Mabualruz: Make Vector feature flags compatible with beta features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967259 (https://phabricator.wikimedia.org/T347772)
[17:27:39] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:28:07] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:28:11] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:28:25] <urandom>	 ^^^ ignore, I should have had those downtimed
[17:28:58] <sukhe>	 ok!
[17:33:50] <urandom>	 !log Decommissioning Cassandra, restbase1018-{a,b,c} — T328490
[17:33:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:54] <stashbot>	 T328490: restbase cluster: decommission end-of-life hosts - https://phabricator.wikimedia.org/T328490
[17:35:00] <wikibugs>	 (03PS4) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[17:35:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[17:39:06] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/96/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur)
[17:42:26] <wikibugs>	 (03PS5) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[17:42:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[17:45:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[17:55:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[18:00:05] <jouncebot>	 brennen and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1800).
[18:00:32] <brennen>	 o/
[18:01:58] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: add new required packages v6.3.4 [puppet] - 10https://gerrit.wikimedia.org/r/966279 (https://phabricator.wikimedia.org/T348987) (owner: 10AOkoth)
[18:02:10] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967262 (https://phabricator.wikimedia.org/T348354)
[18:02:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967262 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot)
[18:02:54] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967262 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot)
[18:08:48] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Make Vector feature flags compatible with beta features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967259 (https://phabricator.wikimedia.org/T347772) (owner: 10Mabualruz)
[18:09:16] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.1  refs T348354
[18:09:30] <stashbot>	 T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354
[18:09:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[18:10:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[18:14:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[18:15:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[18:17:55] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Bump staging image for better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967263
[18:20:05] <brennen>	 ^ hrm.  dead letters did spike around deploy time.  unclear to me what that means, though.
[18:20:26] <wikibugs>	 (03PS6) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[18:20:34] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump staging image for better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967263 (owner: 10Jforrester)
[18:21:24] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Bump staging image for better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967263 (owner: 10Jforrester)
[18:22:10] <jan_drewniak>	 Hello, is there a deploy happening right now? I was wondering if it's ok to backport a beta-cluster config change outside the backport window? (patch in question: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/967259/)
[18:22:13] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:22:54] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:22:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[18:23:23] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:26:07] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:27:29] <RhinosF1>	 jouncebot: now
[18:27:29] <jouncebot>	 For the next 1 hour(s) and 32 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T1800)
[18:28:21] <RhinosF1>	 brennen: can you help jan_drewniak with deployment?
[18:28:52] <RhinosF1>	 Beta only but obviously not stepping on your train
[18:33:41] <dancy>	 jan_drewniak: Train promotion already happened so you should be good to go 
[18:34:38] <jan_drewniak>	 dancy: ok thanks, I'll go ahead with the beta-config change now then.
[18:35:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967259 (https://phabricator.wikimedia.org/T347772) (owner: 10Mabualruz)
[18:35:45] <wikibugs>	 (03Merged) 10jenkins-bot: Make Vector feature flags compatible with beta features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967259 (https://phabricator.wikimedia.org/T347772) (owner: 10Mabualruz)
[18:36:34] <jan_drewniak>	 ok so beta-config changes *are* automatically synced... I wasn't sure about that
[18:38:34] <RhinosF1>	 jan_drewniak: you still need to scap pull
[18:38:57] <dancy>	 `scap backport` handles all of the details.
[18:39:01] <RhinosF1>	 Oh you did scap backport
[18:39:06] <jan_drewniak>	 RhinosF1: right, I ran `scap backport 967259` which did that pull :) 
[18:39:07] <RhinosF1>	 Ye that's fancy enough to be smart
[18:39:48] <RhinosF1>	 So yes all magic happens, beta will deploy itself
[18:39:55] <jan_drewniak>	 yeah  I love it, after the pull it told me: `18:35:57 Skipping sync since all commits were beta/labs-only changes. Operation completed.` 
[18:40:10] <jan_drewniak>	 thanks!
[18:40:18] <RhinosF1>	 Scap backport is an amazing thing
[18:51:30] <brennen>	 jan_drewniak, RhinosF1: sorry i missed the ping here earlier - been digging around in logstash.
[18:51:59] <RhinosF1>	 brennen: no problem, dancy confirmed floor was clear
[19:01:52] <wikibugs>	 (03PS1) 10Herron: pyrra-filesystem: enable generic recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967273 (https://phabricator.wikimedia.org/T302995)
[19:03:35] <wikibugs>	 (03PS1) 10HMonroy: PhonosButton: use text() instead of append() [extensions/Phonos] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967188 (https://phabricator.wikimedia.org/T349312)
[19:03:38] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:03:51] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/97/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967273 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:09:30] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Platonides) What Referer would be provided by such app? Would the requests from the app have a User-Agent identifying it? Which one?
[19:10:31] <wikibugs>	 (03CR) 10Herron: [V: 03+1 C: 03+2] pyrra-filesystem: enable generic recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967273 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:11:29] <wikibugs>	 (03PS3) 10Sohom Datta: InitialiseSettings-labs: Set values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824)
[19:15:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:16:55] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:39] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/967280
[19:23:01] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/98/cons" [puppet] - 10https://gerrit.wikimedia.org/r/967280 (owner: 10Herron)
[19:25:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:27:03] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340)
[19:28:25] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340)
[19:30:41] <wikibugs>	 (03PS7) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[19:31:03] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/99/console" [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) (owner: 10Ryan Kemper)
[19:31:44] <wikibugs>	 (03Abandoned) 10Bartosz Dziewoński: DNM: null edit CI test [extensions/DiscussionTools] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966242 (owner: 10C. Scott Ananian)
[19:32:07] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[19:36:36] <wikibugs>	 (03PS2) 10Herron: pyrra::filesystem: remove [] from slo definition [puppet] - 10https://gerrit.wikimedia.org/r/967280
[19:37:15] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340)
[19:38:24] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/100/console" [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) (owner: 10Ryan Kemper)
[19:39:31] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) (owner: 10Ryan Kemper)
[19:39:44] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: make request latency alerts more granular [puppet] - 10https://gerrit.wikimedia.org/r/967281 (https://phabricator.wikimedia.org/T349340) (owner: 10Ryan Kemper)
[19:40:06] <wikibugs>	 (03CR) 10Herron: [C: 03+2] pyrra::filesystem: remove [] from slo definition [puppet] - 10https://gerrit.wikimedia.org/r/967280 (owner: 10Herron)
[19:40:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:45:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:47:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:50:25] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:56:34] <wikibugs>	 (03PS1) 10Herron: thanos: update reload endpoint to reflect updated web prefix [puppet] - 10https://gerrit.wikimedia.org/r/967291 (https://phabricator.wikimedia.org/T349102)
[19:57:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:58:14] <wikibugs>	 (03CR) 10Herron: "I noticed reload started throwing a 404" [puppet] - 10https://gerrit.wikimedia.org/r/967291 (https://phabricator.wikimedia.org/T349102) (owner: 10Herron)
[20:00:05] <jouncebot>	 brennen and TheresNoTime: Dear deployers, time to do the UTC late backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231019T2000).
[20:02:02] <brennen>	 o/
[20:02:08] <brennen>	 !log utc late backport window: no patches
[20:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:14] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:08:02] <wikibugs>	 (03PS1) 10BCornwall: Add Prometheus metrics for fifo-log-demux [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/967293
[20:11:59] <sbassett>	 Hey brennen and TheresNoTime - was wondering if I could get a security patch out now (saw there was a backport training session today…)
[20:12:18] <James_F>	 sbassett: No patches scheduled so should be good to go.
[20:13:15] <sbassett>	 tx, James_F
[20:15:27] <thcipriani>	 ^ +1 should be fine, brennen and I staring at logs :)
[20:16:38] <sbassett>	 Ok, I might not have a deploy after all as I need a question answered for T336027.  Sorry about that :)
[20:16:56] <wikibugs>	 (03PS2) 10BCornwall: Add Prometheus metrics for fifo-log-demux [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/967293 (https://phabricator.wikimedia.org/T345939)
[20:17:23] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Don't remove current wiki family from $wgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295
[20:19:43] <jinxer-wm>	 (SystemdUnitFailed) firing: puppet-agent-timer.service Failed on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:20:28] <icinga-wm>	 PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:30:00] <icinga-wm>	 PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:31:22] <icinga-wm>	 RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:39:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye
[20:39:42] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye
[20:39:52] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302
[20:39:54] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303
[20:40:38] <icinga-wm>	 RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:40:54] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[20:41:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 (owner: 10Bartosz Dziewoński)
[20:42:18] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I'm trying to split off some bits of this into separate changes that I wouldn't be very afraid to get deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[20:43:57] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) Hi @VirginiaPoundstone! Thanks for the detailed questions! I'll try to answer them one by one :)  > 1. Who is the audien...
[20:44:27] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303
[20:44:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: puppet-agent-timer.service Failed on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:47:20] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10aaron) I wonder if the auth token just expired while the combined...
[20:48:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:49:14] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10aaron) >>! In T341007#9251361, @Beao wrote: > Am I experiencing the same pro...
[20:52:23] <wikibugs>	 (03PS9) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[20:53:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:55:29] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye
[21:01:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:06:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:10:42] <wikibugs>	 (03PS10) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[21:11:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:12:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye
[21:12:55] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye
[21:22:32] <wikibugs>	 (03CR) 10HMonroy: [C: 03+2] PhonosButton: use text() instead of append() [extensions/Phonos] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967188 (https://phabricator.wikimedia.org/T349312) (owner: 10HMonroy)
[21:23:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hmonroy@deploy2002 using scap backport" [extensions/Phonos] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967188 (https://phabricator.wikimedia.org/T349312) (owner: 10HMonroy)
[21:24:19] <wikibugs>	 (03PS11) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[21:25:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[21:25:25] <wikibugs>	 (03Merged) 10jenkins-bot: PhonosButton: use text() instead of append() [extensions/Phonos] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967188 (https://phabricator.wikimedia.org/T349312) (owner: 10HMonroy)
[21:25:44] <logmsgbot>	 !log hmonroy@deploy2002 Started scap: Backport for [[gerrit:967188|PhonosButton: use text() instead of append() (T349312)]]
[21:26:43] <wikibugs>	 (03PS12) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[21:27:00] <logmsgbot>	 !log hmonroy@deploy2002 hmonroy: Backport for [[gerrit:967188|PhonosButton: use text() instead of append() (T349312)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:27:15] <logmsgbot>	 !log hmonroy@deploy2002 hmonroy: Continuing with sync
[21:27:27] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:27:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[21:28:31] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye
[21:29:05] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2045 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:31:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2044 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:32:33] <logmsgbot>	 !log hmonroy@deploy2002 Finished scap: Backport for [[gerrit:967188|PhonosButton: use text() instead of append() (T349312)]] (duration: 06m 48s)
[21:41:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:42:35] <wikibugs>	 (03CR) 10JHathaway: "Is there any way to review what code changed? Why is the old code not being deleted? I don't really have any context on the function, that" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond)
[21:44:36] <wikibugs>	 (03CR) 10JHathaway: compile_redirects: port compile_redirects to new API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond)
[21:49:55] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:50:23] <icinga-wm>	 PROBLEM - BFD status on cr1-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:51:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:53:13] <wikibugs>	 (03PS13) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[21:53:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[22:04:39] <wikibugs>	 (03PS14) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[22:08:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[22:24:53] <icinga-wm>	 RECOVERY - BFD status on cr1-esams is OK: UP: 5 AdminDown: 3 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:25:49] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:28:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[22:28:19] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:28:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:36:46] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED
[22:37:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED
[22:38:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[22:48:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[23:00:17] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:00:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:03:38] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:51:11] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[23:56:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[23:58:14] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. Feel free to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966945 (https://phabricator.wikimedia.org/T47514) (owner: 10Tim Starling)