[00:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:43:16] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:26] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:41:45] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:20] PROBLEM - IPMI Sensor Status on clouddumps1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Inlet Temp = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [02:11:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:30] RECOVERY - IPMI Sensor Status on clouddumps1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [02:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [04:17:06] PROBLEM - IPMI Sensor Status on clouddumps1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Inlet Temp = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [04:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:48:24] RECOVERY - IPMI Sensor Status on clouddumps1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [05:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:20:26] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:16] (03PS20) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [05:52:01] (03PS21) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [05:54:01] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [05:54:07] (03PS5) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [05:54:09] (03PS1) 10Giuseppe Lavagetto: shellbox: conversion to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/840789 [05:54:21] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [05:55:07] (03CR) 10CI reject: [V: 04-1] Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [05:55:09] (03CR) 10CI reject: [V: 04-1] shellbox: conversion to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/840789 (owner: 10Giuseppe Lavagetto) [05:56:59] (03PS1) 10Majavah: P:wmcs::instance: stop provisioning /etc/wmflabs-* on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/840791 [06:06:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 397715 [06:06:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 397715 [06:12:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:15:56] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:24:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:34:40] (03PS1) 10Elukey: ml-services: update Docker images after shared code update [deployment-charts] - 10https://gerrit.wikimedia.org/r/840792 [06:43:10] (03PS2) 10Elukey: ml-services: update Docker images after shared code update [deployment-charts] - 10https://gerrit.wikimedia.org/r/840792 [06:48:09] (03PS2) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [06:54:45] (03CR) 10CI reject: [V: 04-1] icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [06:55:06] (03PS1) 10Muehlenhoff: Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/840794 [06:55:45] (03CR) 10CI reject: [V: 04-1] Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/840794 (owner: 10Muehlenhoff) [06:56:38] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bmansurov out of all services on: 797 hosts [06:56:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bmansurov out of all services on: 797 hosts [06:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [06:58:11] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bmansurov out of all services on: 1211 hosts [06:58:12] (03PS2) 10Muehlenhoff: Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/840794 [06:58:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bmansurov out of all services on: 1211 hosts [06:59:03] (03CR) 10CI reject: [V: 04-1] Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/840794 (owner: 10Muehlenhoff) [07:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:02:52] (03PS3) 10Muehlenhoff: Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/840794 [07:05:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/840794 (owner: 10Muehlenhoff) [07:12:02] (03PS3) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [07:15:42] (03PS2) 10Muehlenhoff: Make ganeti1030 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/839521 (https://phabricator.wikimedia.org/T299459) [07:19:01] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1030 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/839521 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff) [07:19:20] (03CR) 10CI reject: [V: 04-1] icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [07:22:10] (03PS4) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [07:23:17] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker images after shared code update [deployment-charts] - 10https://gerrit.wikimedia.org/r/840792 (owner: 10Elukey) [07:26:49] !log kill hanging process for user bmansurov on deploy1002 to allow proper user cleanup [07:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:07] (03PS1) 10Muehlenhoff: Fix selector in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/840834 [07:31:23] (03CR) 10Muehlenhoff: [C: 03+2] Fix selector in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/840834 (owner: 10Muehlenhoff) [07:31:50] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [07:32:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [07:34:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:35:49] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [07:37:49] (03CR) 10Hashar: "I think this change broke Puppet on the Gitlab runners:" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [07:37:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:37:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:39:35] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:39:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [07:41:49] (03CR) 10JMeybohm: [C: 03+2] Update to Kubernetes v1.23.12 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [07:43:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [07:43:21] !log bounce thanos-compact on thanos-fe2001 [07:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:47] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [07:45:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [07:45:54] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [07:51:02] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:58] !log importes kubernetes 1.23.12 to component/kubernetes123 for buster-wikimedia, bullseye-wikimedia - T307943 [07:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:03] T307943: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 [07:52:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [07:52:59] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:53:39] 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Foundational Technology Requests, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi) >>! In T314981#8293341, @Ottomata wrote: > Cool! Is `this.ip.geoip_asn` bui... [07:54:32] (03PS1) 10Hashar: gitlab: proxy settings on runners must be optional [puppet] - 10https://gerrit.wikimedia.org/r/840835 (https://phabricator.wikimedia.org/T317997) [07:55:40] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:45] (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:56:57] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [07:57:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:58:01] (03CR) 10Hashar: "I have cherry picked it on gitlab-runners-puppetmaster-01.gitlab-runners.eqiad1.wikimedia.cloud and that fixed up the compilation when usi" [puppet] - 10https://gerrit.wikimedia.org/r/840835 (https://phabricator.wikimedia.org/T317997) (owner: 10Hashar) [08:01:03] (03PS1) 10JMeybohm: Bump threshold for LIST secrets from 1.7s to 2s [alerts] - 10https://gerrit.wikimedia.org/r/840837 (https://phabricator.wikimedia.org/T311251) [08:01:11] (03CR) 10Volans: "Nice work! I've just left two suggestions and few minor nits, but the code would work also as-is." [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [08:04:18] (03CR) 10JMeybohm: [C: 03+2] Bump threshold for LIST secrets from 1.7s to 2s [alerts] - 10https://gerrit.wikimedia.org/r/840837 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [08:04:31] (03CR) 10Ayounsi: Add section for PIC config of QFX5120-48Y port block speeds (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [08:05:31] 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) max check alert latency did shoot up over last weekend (though to 7min) and self-recovered as far as I can tell: {F35560029} [08:05:55] !log restarting db2100:s7 to apply new buffer pool config [08:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:51] (03Merged) 10jenkins-bot: Bump threshold for LIST secrets from 1.7s to 2s [alerts] - 10https://gerrit.wikimedia.org/r/840837 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [08:07:14] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37488/console" [puppet] - 10https://gerrit.wikimedia.org/r/840835 (https://phabricator.wikimedia.org/T317997) (owner: 10Hashar) [08:08:14] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, thanks for fixing this!" [puppet] - 10https://gerrit.wikimedia.org/r/840835 (https://phabricator.wikimedia.org/T317997) (owner: 10Hashar) [08:09:14] !log online resizefs of backup2003 bacula partition [08:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1030.eqiad.wmnet to cluster eqiad and group A [08:13:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1030.eqiad.wmnet to cluster eqiad and group A [08:17:11] (03CR) 10Clément Goubert: Release upstream version 3.9.4 (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [08:18:19] (03CR) 10Clément Goubert: [C: 03+2] Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [08:21:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:30] (03Merged) 10jenkins-bot: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [08:23:26] !log online resizefs of backup1003 bacula partition [08:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:47] (03CR) 10Muehlenhoff: [C: 03+2] Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/840125 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [08:27:58] (03PS3) 10Clément Goubert: Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 [08:28:56] !log set thanos ring replicas to 3.68 T311690 [08:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:00] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [08:29:10] (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/840143 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:32:10] (03PS1) 10Muehlenhoff: Revert "Switch profile::base::linux510 to the new meta package" [puppet] - 10https://gerrit.wikimedia.org/r/841031 [08:32:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:09] (03CR) 10Clément Goubert: [C: 03+2] Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 (owner: 10Clément Goubert) [08:33:16] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: proxy settings on runners must be optional [puppet] - 10https://gerrit.wikimedia.org/r/840835 (https://phabricator.wikimedia.org/T317997) (owner: 10Hashar) [08:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:33:32] (03CR) 10Hashar: [V: 03+2 C: 03+2] Allow SRE to send annotated and signed tags [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar) [08:34:15] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Clément, that would let SRE push any kind of tags ;)" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar) [08:35:03] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Switch profile::base::linux510 to the new meta package" [puppet] - 10https://gerrit.wikimedia.org/r/841031 (owner: 10Muehlenhoff) [08:35:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. I'll leave andrew to merge it in case he needs to check anything else." [puppet] - 10https://gerrit.wikimedia.org/r/840791 (owner: 10Majavah) [08:37:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah) [08:37:16] (03Merged) 10jenkins-bot: Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 (owner: 10Clément Goubert) [08:37:18] (03CR) 10JMeybohm: [C: 03+1] Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [08:37:59] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update cfssl-issuer image to v0.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838131 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [08:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:45:03] (03PS1) 10Muehlenhoff: cloudgw: Stop including profile::base::linux510 [puppet] - 10https://gerrit.wikimedia.org/r/841032 [08:45:54] (03CR) 10Btullis: [C: 03+2] Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [08:46:08] (03PS3) 10Btullis: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) [08:46:54] (03PS1) 10Majavah: P:toolforge::prometheus: scrape jobs-api [puppet] - 10https://gerrit.wikimedia.org/r/841033 (https://phabricator.wikimedia.org/T320284) [08:53:53] (03PS1) 10Muehlenhoff: cloudnet: Stop including profile::base::linux510 [puppet] - 10https://gerrit.wikimedia.org/r/841034 [08:54:10] (03PS3) 10Jcrespo: mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) [08:57:35] (03PS4) 10Jcrespo: mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) [08:57:41] (03CR) 10Volans: Adapt sre.switchdc.mediawiki to active-active mediawiki (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [08:57:48] (03PS4) 10Volans: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [08:58:20] (03CR) 10Volans: "I've sent a new PS addressing my comments, _joe_, jbond let me know what do you think." [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [08:58:35] (03CR) 10Jcrespo: mariadb: Set binlog format for dbstore mariadb databases to ROW (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [08:59:11] (03PS14) 10David Caro: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [08:59:38] (03CR) 10Jcrespo: [C: 03+2] mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [09:00:58] (03CR) 10Jcrespo: [C: 03+2] "CC Btullis as this theoretically impacts dbstores (while it was mainly targeted to backup sources), but it is a noop (as there is no binlo" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [09:01:51] (03PS1) 10Muehlenhoff: Make ganeti1031 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841035 (https://phabricator.wikimedia.org/T299459) [09:02:01] (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [09:02:27] (03CR) 10CI reject: [V: 04-1] Make ganeti1031 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841035 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff) [09:03:00] (03PS9) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [09:03:26] (03CR) 10Btullis: [C: 03+1] "Yep, makes sense. Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [09:04:45] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) It looks like I've found a mitigation for this issue, tested in [[ https://... [09:05:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudnet: Stop including profile::base::linux510 [puppet] - 10https://gerrit.wikimedia.org/r/841034 (owner: 10Muehlenhoff) [09:05:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/841032 (owner: 10Muehlenhoff) [09:11:33] (03CR) 10Muehlenhoff: [C: 03+2] cloudgw: Stop including profile::base::linux510 [puppet] - 10https://gerrit.wikimedia.org/r/841032 (owner: 10Muehlenhoff) [09:13:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::prometheus: scrape jobs-api [puppet] - 10https://gerrit.wikimedia.org/r/841033 (https://phabricator.wikimedia.org/T320284) (owner: 10Majavah) [09:13:45] (03CR) 10Muehlenhoff: P:base: configure Linux 5.10 on buster via Hiera (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [09:14:24] (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/841036 (https://phabricator.wikimedia.org/T317748) [09:15:15] (03PS2) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/841036 (https://phabricator.wikimedia.org/T317748) [09:15:29] (03CR) 10Muehlenhoff: [C: 03+2] cloudnet: Stop including profile::base::linux510 [puppet] - 10https://gerrit.wikimedia.org/r/841034 (owner: 10Muehlenhoff) [09:17:15] (03CR) 10Elukey: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:17:36] (03PS2) 10Muehlenhoff: Make ganeti1031 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841035 (https://phabricator.wikimedia.org/T299459) [09:18:10] (03CR) 10JMeybohm: [C: 04-1] Add a spark-operator production image (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:21:49] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37490/console" [puppet] - 10https://gerrit.wikimedia.org/r/841036 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [09:23:46] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/841036 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [09:26:04] !log partitioning the ATS cache in cp1089, cp1090, cp2041, cp2042, cp3064, cp3065, cp4034, cp4036, cp5014, cp5016, cp6007, cp6015 - T317748 [09:26:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: stop checking per-appserver availability [puppet] - 10https://gerrit.wikimedia.org/r/825742 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [09:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:09] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [09:26:49] (03CR) 10JMeybohm: [C: 04-1] Add a new production image for spark version 3.3.0 (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:27:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Why would this need to be a separated group build from the normal production images?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:29:25] (03PS5) 10Volans: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [09:30:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:30:31] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: stop checking per-appserver availability [puppet] - 10https://gerrit.wikimedia.org/r/825742 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [09:30:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:30:37] (03PS3) 10Filippo Giunchedi: mediawiki: stop checking per-appserver availability [puppet] - 10https://gerrit.wikimedia.org/r/825742 (https://phabricator.wikimedia.org/T314118) [09:30:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T314041)', diff saved to https://phabricator.wikimedia.org/P35383 and previous config saved to /var/cache/conftool/dbconfig/20221010-093041-ladsgroup.json [09:30:46] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:33:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:33:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:33:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T314041)', diff saved to https://phabricator.wikimedia.org/P35384 and previous config saved to /var/cache/conftool/dbconfig/20221010-093334-ladsgroup.json [09:33:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [09:35:14] !log Imported helm3 3.9.4-1 to buster-wikimedia and bullseye-wikimedia [09:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [09:43:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1026.eqiad.wmnet to cluster eqiad and group A [09:44:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1026.eqiad.wmnet to cluster eqiad and group A [09:44:50] (03CR) 10David Caro: Add SSH key for sstefanova to authorized keys (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova) [09:52:44] (03CR) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [09:53:55] (03CR) 10Hashar: POST events asynchronously (032 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar) [09:54:15] (03PS3) 10Hashar: POST events asynchronously [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 [09:54:25] (03CR) 10CI reject: [V: 04-1] POST events asynchronously [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar) [09:54:31] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [09:54:37] uh [09:54:58] siiigh that might be me vgutierrez [09:55:06] checking [09:55:12] thx [09:56:25] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10Arian_Bozorg) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks i... [09:58:17] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10RhinosF1) [09:58:36] (03PS10) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [10:00:05] Urbanecm and Amir1: gettimeofday() says it's time for New wikis creation. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T1000) [10:00:09] * urbanecm waves! [10:00:50] o/ [10:02:04] (03CR) 10CI reject: [V: 04-1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [10:02:57] * urbanecm forgot to upload patches, doing... [10:03:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [10:03:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [10:04:19] (03CR) 10FNegri: [C: 04-1] "I think this patch can be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/840107 (https://phabricator.wikimedia.org/T320232) (owner: 10Arturo Borrero Gonzalez) [10:04:40] !log rolling upgrade to HAProxy 2.4.19 on both text and upload caching clusters [10:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:47] (03PS1) 10Urbanecm: Initial configuration for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841086 (https://phabricator.wikimedia.org/T319183) [10:06:49] (03PS1) 10Urbanecm: Initial configuration for tlwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841087 (https://phabricator.wikimedia.org/T317107) [10:07:13] (03PS11) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [10:07:49] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841086 (https://phabricator.wikimedia.org/T319183) (owner: 10Urbanecm) [10:08:53] (03Merged) 10jenkins-bot: Initial configuration for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841086 (https://phabricator.wikimedia.org/T319183) (owner: 10Urbanecm) [10:09:08] pulling to mwmaint [10:09:42] and...addWiki errors [10:09:51] https://www.irccloud.com/pastebin/Nyr3ygdw/ [10:10:36] Amir1: ^^ [10:10:44] sigh [10:10:48] let me check [10:10:56] oh lovely [10:11:15] does it still try to close connections? [10:11:19] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10Lydia_Pintscher) [10:11:38] it has this, but i'm not sure if redefineLocalDomain closes connections or not https://www.irccloud.com/pastebin/2KnsC4TL/ [10:12:06] calls to closeConnection() were removed with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/815301, and that's in prod [10:13:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:13:24] I know how to fix this, the destruct is sorta broken. You need to change the line 159 variable to have another name, e.g. echoDB or something [10:14:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:14:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:14:24] ah, yeah. like last time :/ [10:14:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:15:02] Amir1: trying, but it says a root has the file opened :) [10:15:14] I wonder who that'd be [10:15:22] urbanecm: free now [10:15:23] sorry [10:15:27] no problem, thanks [10:15:47] (03PS2) 10Muehlenhoff: wmcs::services: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840143 (https://phabricator.wikimedia.org/T308013) [10:17:15] worked [10:18:29] Hey, can anyone please review this patch with +2 https://gerrit.wikimedia.org/r/840304 [10:19:01] wiki's up, syncing [10:19:14] (03CR) 10Muehlenhoff: [C: 03+2] wmcs::services: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840143 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:19:17] !log urbanecm@deploy1002 Started scap: Creating bnwikiquote (T319183) [10:19:22] T319183: Create Wikiquote Bengali - https://phabricator.wikimedia.org/T319183 [10:19:43] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10Aklapper) Please follow https://phabricator.wikimedia.org/tag/ldap-access-requests/ and don't add unrelated project tags presumably from some ancient tasks. Thanks a lot! :) [10:20:17] MdsShakil: you should ask in another channel this is for operations stuff, that being said, let me double check this with a bengali friend and get it merged [10:21:18] (03CR) 10Volans: "Sorry if I'm missing previous context, left few comments inline, nothing major." [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [10:22:53] (03PS1) 10Urbanecm: Initial configuration for bclwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841088 (https://phabricator.wikimedia.org/T316453) [10:24:00] (03CR) 10Jelto: [C: 04-1] "I have some concerns because there is no transparency about what images are in security-products. See more details here: T312961#8304248" [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett) [10:24:05] (03PS2) 10Urbanecm: Initial configuration for tlwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841087 (https://phabricator.wikimedia.org/T317107) [10:24:14] !log urbanecm@deploy1002 Finished scap: Creating bnwikiquote (T319183) (duration: 04m 56s) [10:24:28] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for tlwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841087 (https://phabricator.wikimedia.org/T317107) (owner: 10Urbanecm) [10:24:33] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:24:34] (03PS2) 10Urbanecm: Initial configuration for bclwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841088 (https://phabricator.wikimedia.org/T316453) [10:25:15] (03Merged) 10jenkins-bot: Initial configuration for tlwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841087 (https://phabricator.wikimedia.org/T317107) (owner: 10Urbanecm) [10:25:47] (03PS1) 10David Caro: wmcs.create_instance_with_prefix: Fix sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 [10:26:01] (03CR) 10David Caro: [C: 03+2] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [10:26:37] 10SRE, 10GitLab, 10Infrastructure-Foundations, 10CAS-SSO: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) [10:26:45] (03PS2) 10David Caro: wmcs.create_instance_with_prefix: Fix sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 [10:26:51] 10SRE, 10GitLab, 10Infrastructure-Foundations, 10CAS-SSO: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) p:05Triage→03Medium [10:27:06] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [10:27:34] (03CR) 10David Caro: wmcs.create_instance_with_prefix: Fix sec group default (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro) [10:28:17] !log urbanecm@deploy1002 Started scap: Creating tlwikiquote (T317107) [10:28:21] T317107: Create Wikiquote Tagalog - https://phabricator.wikimedia.org/T317107 [10:28:31] tlwikiquote works, syncing [10:28:55] wohoo [10:29:17] (03Merged) 10jenkins-bot: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [10:30:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:32:21] !log urbanecm@deploy1002 Finished scap: Creating tlwikiquote (T317107) (duration: 04m 04s) [10:32:37] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for bclwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841088 (https://phabricator.wikimedia.org/T316453) (owner: 10Urbanecm) [10:33:24] (03Merged) 10jenkins-bot: Initial configuration for bclwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841088 (https://phabricator.wikimedia.org/T316453) (owner: 10Urbanecm) [10:33:36] (03PS1) 10Urbanecm: Initial configuration for igwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841090 (https://phabricator.wikimedia.org/T314636) [10:34:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:34:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:35:05] bclwikiquote works, syncing [10:35:18] !log urbanecm@deploy1002 scap failed: FileNotFoundError [Errno 2] Invalid/unavailable version dir: '/srv/mediawiki-staging/php-1.40-0-wmf.4' (duration: 00m 00s) [10:36:26] (03PS1) 10Urbanecm: bclwikiquote: Fix invalid wikiversions.json entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840569 (https://phabricator.wikimedia.org/T316453) [10:36:28] (03CR) 10Urbanecm: [C: 03+2] bclwikiquote: Fix invalid wikiversions.json entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840569 (https://phabricator.wikimedia.org/T316453) (owner: 10Urbanecm) [10:36:45] !log urbanecm@deploy1002 Started scap: Creating bclwikiquote (T316453) [10:36:49] T316453: Create Wikiquote Central Bikol - https://phabricator.wikimedia.org/T316453 [10:37:10] (03Merged) 10jenkins-bot: bclwikiquote: Fix invalid wikiversions.json entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840569 (https://phabricator.wikimedia.org/T316453) (owner: 10Urbanecm) [10:37:44] (03PS2) 10Urbanecm: Initial configuration for igwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841090 (https://phabricator.wikimedia.org/T314636) [10:37:48] (03PS4) 10David Caro: iwmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah) [10:38:30] (03CR) 10David Caro: [C: 03+1] "Just rebased and fixed the merge conflict." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah) [10:38:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:39:40] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [10:40:24] (03Abandoned) 10Arturo Borrero Gonzalez: cloud: eqiad1: depool rabbitmq02 [puppet] - 10https://gerrit.wikimedia.org/r/840107 (https://phabricator.wikimedia.org/T320232) (owner: 10Arturo Borrero Gonzalez) [10:40:29] (03PS5) 10Majavah: iwmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 [10:40:53] (03CR) 10Majavah: [C: 03+1] iwmcs: k8s: Fix cluster-info parsing (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah) [10:40:56] !log urbanecm@deploy1002 Finished scap: Creating bclwikiquote (T316453) (duration: 04m 11s) [10:41:34] (03PS1) 10Urbanecm: Initial configuration for igwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841091 (https://phabricator.wikimedia.org/T314635) [10:41:39] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for igwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841090 (https://phabricator.wikimedia.org/T314636) (owner: 10Urbanecm) [10:42:28] (03Merged) 10jenkins-bot: Initial configuration for igwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841090 (https://phabricator.wikimedia.org/T314636) (owner: 10Urbanecm) [10:43:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:44:35] (03CR) 10David Caro: [C: 03+2] iwmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah) [10:44:50] !log urbanecm@deploy1002 Started scap: Creating igwikiquote (T314636) [10:44:54] T314636: Create Wikiquote Igbo - https://phabricator.wikimedia.org/T314636 [10:44:55] syncing igwikiquote now [10:45:43] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for igwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841091 (https://phabricator.wikimedia.org/T314635) (owner: 10Urbanecm) [10:46:32] (03Merged) 10jenkins-bot: Initial configuration for igwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841091 (https://phabricator.wikimedia.org/T314635) (owner: 10Urbanecm) [10:47:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:47:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:48:11] (03Merged) 10jenkins-bot: iwmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah) [10:49:14] !log urbanecm@deploy1002 Finished scap: Creating igwikiquote (T314636) (duration: 04m 24s) [10:51:02] igwiktionary works, syncing [10:51:19] !log urbanecm@deploy1002 Started scap: Creating igwiktionary (T314635) [10:51:23] T314635: Create Wiktionary Igbo - https://phabricator.wikimedia.org/T314635 [10:51:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet [10:51:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:52:01] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/840273 (owner: 10Majavah) [10:54:46] (03CR) 10Clément Goubert: Allow SRE to send annotated and signed tags (031 comment) [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar) [10:55:32] !log urbanecm@deploy1002 Finished scap: Creating igwiktionary (T314635) (duration: 04m 13s) [10:56:03] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840570 [10:56:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840570 (owner: 10Urbanecm) [10:56:57] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840570 (owner: 10Urbanecm) [10:57:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:57:06] (03CR) 10David Caro: update link [puppet] - 10https://gerrit.wikimedia.org/r/832646 (https://phabricator.wikimedia.org/T317987) (owner: 10Vivian Rook) [10:57:08] (03Abandoned) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/839554 (owner: 10Clément Goubert) [10:57:13] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:840570|Update interwiki cache]] [10:57:33] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:840570|Update interwiki cache]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [10:57:42] (03CR) 10David Caro: "looking, I think this alert affects all hosts, not only labstores/nfs servers" [puppet] - 10https://gerrit.wikimedia.org/r/832646 (https://phabricator.wikimedia.org/T317987) (owner: 10Vivian Rook) [10:58:32] (03CR) 10Volans: "one comment inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [11:00:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:00:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:00:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet [11:01:05] urbanecm: if you're done, shall we do the thing we were planning to do? [11:01:27] yep yep, just waiting on the interwiki cache sync to finish [11:01:27] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:840570|Update interwiki cache]] (duration: 04m 13s) [11:01:33] which just happened, so, let's do it [11:01:34] awesome [11:01:55] (03CR) 10David Caro: "Yep, just as he comment in the task says: https://phabricator.wikimedia.org/T317987#8243807" [puppet] - 10https://gerrit.wikimedia.org/r/832646 (https://phabricator.wikimedia.org/T317987) (owner: 10Vivian Rook) [11:02:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:06:10] (03CR) 10Volans: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [11:07:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:08:14] (03PS1) 10Majavah: Revert "P:toolforge::prometheus: disable k8s label map" [puppet] - 10https://gerrit.wikimedia.org/r/841082 [11:08:23] (03PS2) 10Majavah: Revert "P:toolforge::prometheus: disable k8s label map" [puppet] - 10https://gerrit.wikimedia.org/r/841082 [11:10:13] (03CR) 10Jbond: [C: 03+1] pki: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840142 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:11:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:11:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:12:10] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:13:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:14:18] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:22:43] (03PS1) 10Jbond: 6.6.1: update files to prepare for 6.6.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841112 [11:23:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1014.eqiad.wmnet to cluster eqiad and group B [11:24:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1014.eqiad.wmnet to cluster eqiad and group B [11:26:19] 10SRE, 10Traffic, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) [11:26:30] 10SRE, 10Traffic, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) p:05Triage→03Medium [11:29:42] (03PS1) 10Jcrespo: dbbackups: Upgrade db2102 to 10.6 to test myloader fix [puppet] - 10https://gerrit.wikimedia.org/r/841116 (https://phabricator.wikimedia.org/T319383) [11:31:05] 10SRE, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10hnowlan) [11:31:21] (03CR) 10Jbond: [C: 04-1] Add type Wmflib::POSIX::Name (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy) [11:36:06] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Upgrade db2102 to 10.6 to test myloader fix [puppet] - 10https://gerrit.wikimedia.org/r/841116 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [11:36:07] 10SRE, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10hnowlan) a:03hnowlan [11:37:41] 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10hnowlan) [11:39:22] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [11:48:45] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1031 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841035 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff) [11:51:19] (03PS3) 10Samtar: swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) [11:52:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/840268 (owner: 10Volans) [12:02:14] (03PS4) 10Samtar: swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) [12:07:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [12:10:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10ayounsi) LibreNMS have a 5min resolution, while Prometheus is much more fine grained. [12:11:14] (03PS2) 10Jbond: 6.6.1: update files to prepare for 6.6.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841112 [12:11:16] (03PS1) 10Jbond: cas: drop u2f support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841122 [12:12:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10cmooney) The difference is just to do with sampling / how it's graphed. The Prometheus query there is using irate([5m]), which I... [12:12:50] (03PS3) 10Jbond: 6.6.1: update files to prepare for 6.6.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841112 [12:13:03] (03PS4) 10Jbond: 6.6.1: update files to prepare for 6.6.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841112 [12:14:41] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Update CAS to 6.6 - https://phabricator.wikimedia.org/T311235 (10MoritzMuehlenhoff) [12:14:45] !log installing ruby-rack security updates [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:57] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-34), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) [12:15:52] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-34), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) [12:16:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [12:18:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1031.eqiad.wmnet to cluster eqiad and group A [12:19:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1031.eqiad.wmnet to cluster eqiad and group A [12:21:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10cmooney) > at least one of them is not accurate Neither of them is accurate. It's almost impossible to have an accurate represe... [12:22:21] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti4004 [puppet] - 10https://gerrit.wikimedia.org/r/841124 [12:25:16] (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [12:25:19] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [12:25:46] 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10LSobanski) p:05Triage→03Medium [12:27:28] (03CR) 10Volans: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [12:27:43] (03CR) 10Jbond: [C: 03+1] logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:28:27] !log installing jetty9 security updates [12:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:51] (03PS6) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272) [12:29:43] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [12:31:27] (03CR) 10Volans: [C: 03+2] Backends: add timeout to PuppetDB (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/840268 (owner: 10Volans) [12:33:00] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:35:11] !log installing puma security updates [12:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:38] (03CR) 10Vivian Rook: "wiki updated" [puppet] - 10https://gerrit.wikimedia.org/r/832646 (https://phabricator.wikimedia.org/T317987) (owner: 10Vivian Rook) [12:35:46] (03Abandoned) 10Vivian Rook: update link [puppet] - 10https://gerrit.wikimedia.org/r/832646 (https://phabricator.wikimedia.org/T317987) (owner: 10Vivian Rook) [12:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:38:59] (03CR) 10Volans: [C: 03+2] docs: Remove broken badges [software/cumin] - 10https://gerrit.wikimedia.org/r/840256 (owner: 10Krinkle) [12:40:46] (03Merged) 10jenkins-bot: Backends: add timeout to PuppetDB [software/cumin] - 10https://gerrit.wikimedia.org/r/840268 (owner: 10Volans) [12:43:28] PROBLEM - IPMI Sensor Status on clouddumps1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Inlet Temp = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:45:15] (03PS3) 10Slavina Stefanova: Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) [12:45:23] (03Merged) 10jenkins-bot: docs: Remove broken badges [software/cumin] - 10https://gerrit.wikimedia.org/r/840256 (owner: 10Krinkle) [12:45:46] (03CR) 10Slavina Stefanova: Add SSH key for sstefanova to authorized keys (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova) [12:49:14] 10SRE, 10Traffic, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) Checking the client implementation for `go.etcd.io/etcd/client/v2 v2.305.4` it looks like the SRV discoverer share code with v3: https://github.com/etcd-io/etcd/blob... [12:51:21] (03CR) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [12:51:26] (03PS12) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [12:52:34] 10SRE, 10Traffic, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Joe) The correct domain to test for read-only clients is `conftool.eqiad.wmnet`, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/... [12:54:49] (03CR) 10CI reject: [V: 04-1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [12:55:29] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [12:57:26] 10SRE, 10Traffic, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) you're right in that regard: ` vgutierrez@lvs6001:~$ ./l4lb etcd --domain conftool.eqiad.wmnet 2022/10/10 12:55:44 dns lookup errors: lookup _etcd-client-ssl._tcp.co... [12:59:26] 10SRE, 10Traffic, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Joe) yeah this changed with v3. The problem is that AIUI confd uses an older version of the library and expects the simpler form we have now. We can either add a new set of rec... [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T1300) [13:00:05] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] (03PS5) 10Samtar: swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) [13:01:35] o/ [13:03:17] I can't deploy, sorry [13:05:53] I can deploy if needed :) will give urbanecm a moment though [13:06:13] (03PS2) 10Jelto: gitlab_runner: add option to drop Docker capabilities [puppet] - 10https://gerrit.wikimedia.org/r/773746 (https://phabricator.wikimedia.org/T295481) [13:06:38] (03CR) 10CI reject: [V: 04-1] gitlab_runner: add option to drop Docker capabilities [puppet] - 10https://gerrit.wikimedia.org/r/773746 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:09:11] koi: ref https://phabricator.wikimedia.org/T319537#8301427, you need that run first, yes? [13:09:24] TheresNoTime: yes [13:09:51] Okay :) [13:10:06] urbanecm: FYI I'm going to deploy :) [13:11:04] !log [samtar@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php trwikivoyage wikilove [13:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:24] (03PS2) 10Samtar: trwikivoyage: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840235 (https://phabricator.wikimedia.org/T319537) (owner: 10Stang) [13:12:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840235 (https://phabricator.wikimedia.org/T319537) (owner: 10Stang) [13:13:23] (03Merged) 10jenkins-bot: trwikivoyage: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840235 (https://phabricator.wikimedia.org/T319537) (owner: 10Stang) [13:13:36] !log samtar@deploy1002 Started scap: Backport for [[gerrit:840235|trwikivoyage: Install WikiLove extension (T319537)]] [13:13:41] T319537: Deployment WikiLove extension (trwikivoyage) - https://phabricator.wikimedia.org/T319537 [13:13:55] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:840235|trwikivoyage: Install WikiLove extension (T319537)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:13:58] koi: please test ^ [13:14:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:14:19] looking [13:14:40] RECOVERY - IPMI Sensor Status on clouddumps1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:16:16] TheresNoTime: it works correctly [13:16:34] koi: syncing [13:19:15] koi: was the `blockDisabledAccounts` maintenance script run also your addition? [13:19:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:19:33] yeah [13:19:41] (03CR) 10FNegri: [C: 03+1] Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova) [13:19:49] (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #2 [puppet] - 10https://gerrit.wikimedia.org/r/841126 [13:19:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:20:09] but I'm not familiar with this script, so hope someone more familiar with it could run it [13:20:31] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:840235|trwikivoyage: Install WikiLove extension (T319537)]] (duration: 06m 55s) [13:20:36] T319537: Deployment WikiLove extension (trwikivoyage) - https://phabricator.wikimedia.org/T319537 [13:20:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:20:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:20:59] koi: okay, I have never run it — what's the risks? [13:21:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T314041)', diff saved to https://phabricator.wikimedia.org/P35386 and previous config saved to /var/cache/conftool/dbconfig/20221010-132115-ladsgroup.json [13:21:19] don't know 0_o [13:21:20] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:21:44] koi: hm, in that case, I'm going to say I don't personally feel comfortable running that [13:21:48] !log draining ganeti1006 T320419 [13:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:52] T320419: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 [13:22:25] * TheresNoTime is reading T106068 quickly.. [13:22:34] er, T106068 [13:22:35] T106068: [DisableAccount] Remove "inactive" user group - https://phabricator.wikimedia.org/T106068 [13:23:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:28] well IMO it would be hard to know if it works as expected, as I don't have access to most of the private wikis...? [13:24:45] (03PS13) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [13:25:07] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37491/console" [puppet] - 10https://gerrit.wikimedia.org/r/841126 (owner: 10Vgutierrez) [13:25:16] koi: ack, good point.. [13:26:24] any suggestions from the channel (cc urbanecm) ref running `blockDisabledAccounts.php` (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T1300) from T106068 [13:27:14] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #2 [puppet] - 10https://gerrit.wikimedia.org/r/841126 (owner: 10Vgutierrez) [13:28:15] (03CR) 10CI reject: [V: 04-1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [13:29:10] koi: yeah apologies, I'm going to say no for a run of that — would suggest rescheduling and seeing if someone more familiar with the script (afaics it's not normally run in production?) can be around. Sorry! :) [13:29:44] TheresNoTime: fair enough, I would ask for someone else [13:30:26] I’m here now [13:30:34] not familiar with the script but I’ll take a look [13:30:39] ah thanks Lucas_WMDE :) [13:31:02] I'd rather be safe than sorry :D [13:31:42] !log partitioning the ATS cache in cp1087, cp1088, cp2039, cp2040, cp3062, cp3063, cp4033, cp4035, cp5013, cp5015, cp6006, cp6014 - T317748 [13:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:47] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [13:32:30] hrm, doesn’t look like the script has a dry-run mode [13:32:54] yeah, I did have a look to see if it did :( [13:33:39] (03PS1) 10Muehlenhoff: Make ganeti1032 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841127 (https://phabricator.wikimedia.org/T299459) [13:34:13] (03CR) 10CI reject: [V: 04-1] Make ganeti1032 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841127 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff) [13:36:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P35387 and previous config saved to /var/cache/conftool/dbconfig/20221010-133621-ladsgroup.json [13:36:35] 10SRE, 10Traffic, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Volans) >>! In T320397#8304869, @Joe wrote: > The correct domain to test for read-only clients is `conftool.eqiad.wmnet`, see https://gerrit.wikimedia.org/r/plugins/gitiles/oper... [13:37:05] (03PS2) 10Muehlenhoff: Make ganeti1032 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841127 (https://phabricator.wikimedia.org/T299459) [13:39:02] looks like in fishbowl.dblist, only one wiki has any such accounts (donatewiki, 23 of them) [13:39:45] (03PS14) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [13:39:45] some more wikis in private.dblist have it, no more than 100 accounts in any one wiki though [13:40:21] (oh, and that’s only checking for the user group, I didn’t notice the “nulled” part of the maintenance script that might potentially further reduce the numbers) [13:40:38] (no wait, it would *increase* the numbers, those are added) [13:41:42] ok, each wiki has a handful of “nulled” accounts, but none more than 100 again [13:41:59] koi: should those accounts (password and email = '') also be blocked? [13:42:17] (asking since the maintenance script was copied from somewhere, so maybe we’re not actually interested in that part) [13:43:16] Lucas_WMDE: I thought only need to consider if a user is inside "inactive" group, as the aim is to remove this group [13:43:26] PROBLEM - puppet last run on bast1003 is CRITICAL: CRITICAL: Puppet has been disabled for 605094 seconds, message: test ssh - jbond, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:44:06] hm [13:45:19] I wonder if those users are already blocked [13:46:22] well, I found at least one is not blocked (on a private wiki) [13:47:19] an inactive or “nulled” user? [13:47:28] (03CR) 10David Caro: Revert "P:toolforge::prometheus: disable k8s label map" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841082 (owner: 10Majavah) [13:47:33] what do you mean "nulled" [13:47:59] that’s what the maintenance script seems to call users with password and email set to the empty string [13:48:03] (idk if there’s a better term for it) [13:48:17] I ran another query, and unless I did it wrong, at least some of those users aren’t blocked [13:48:29] if we don’t want to block them, the maintenance script should probably be updated [13:49:47] (03PS3) 10Majavah: Revert "P:toolforge::prometheus: disable k8s label map" [puppet] - 10https://gerrit.wikimedia.org/r/841082 [13:50:24] (03CR) 10Majavah: Revert "P:toolforge::prometheus: disable k8s label map" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841082 (owner: 10Majavah) [13:50:27] iirc a user is nulled, it could not login unless manually set those "nulled" field (via a maintenance script), in this point I thought it's not needed to block them? [13:50:51] ok, but the script, as it currently stands, would block them [13:50:52] (03CR) 10David Caro: [C: 03+2] "Thanks! (waiting for the tests before merging)" [puppet] - 10https://gerrit.wikimedia.org/r/841082 (owner: 10Majavah) [13:51:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P35388 and previous config saved to /var/cache/conftool/dbconfig/20221010-135128-ladsgroup.json [13:52:25] (03PS1) 10Matthias Mullie: Explicitly set wgPageImagesNamespaces to none where disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841133 (https://phabricator.wikimedia.org/T306883) [13:53:21] Lucas_WMDE: got it, it's the `array_merge` part at L42. I thought it would be great to confirm if it's necessary to also block those "nulled" user [13:53:32] honestly, I’d also like to see some more explanation in the Phabricator task of why we actually want to do this [13:53:44] so far, the justification seems to be that Bugreporter thinks it’s a good idea [13:54:08] do other people agree? I don’t really see that from the discussion so far [13:55:15] koi: who should confirm that? the deployer or the person requesting the maintenance script run? ^^ [13:55:49] I'll do that [13:55:54] ok, thanks [13:55:56] would postpone (and possibly modify some part of the script) the run of that script, and ask for suggestion at Phabricator task [13:55:58] then I guess this is postponed for now [13:56:15] 👍 [13:56:16] yeah, and thanks for your investigation today [13:56:49] (if the script is already being touch: also turn `$block->mExpiry = 'infinity';` into a setExpiry() call to avoid a deprecation warning) [13:56:53] *touched [13:57:24] TheresNoTime: anything else to deploy? [13:57:34] Lucas_WMDE: nope, that's all :) [13:57:43] !log UTC afternoon backport+config window done [13:57:54] 10SRE, 10Traffic, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) hmm from the mentioned documentation on the task description: ` If etcd is using TLS, the discovery SRV record (e.g. example.com) must be included in the SSL certifi... [13:58:08] what’s up with stashbot [13:58:24] last prod SAL was at :36, it already missed at least one log by Amir1 above [13:58:26] oop [13:58:51] wmopbot as well, toolforge issue? [13:59:46] (03PS9) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [13:59:50] (03PS1) 10Muehlenhoff: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841134 (https://phabricator.wikimedia.org/T319067) [13:59:56] TheresNoTime: looking [14:00:06] (03PS10) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [14:00:17] stashbot kubectl log has some twitter errors [14:00:19] TheresNoTime: webservices are replying ok, what is the issue you see? [14:00:24] but I seem to fainly remember those are normal [14:00:25] (03PS1) 10Clément Goubert: datahub: Change elasticsearch host/port defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/841135 (https://phabricator.wikimedia.org/T317511) [14:00:27] (03CR) 10Jbond: [C: 04-1] "change lgtm but unless im missing something we still need to call self.recheck_failed_services() regardless of skip_acked" [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [14:00:35] (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [14:00:38] dcaro: !log messages aren’t being added to the SAL [14:00:53] okok [14:01:57] !log 13:51 ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P35388 and previous config saved to /var/cache/conftool/dbconfig/20221010-135128-ladsgroup.json # re-logging due to stashbot issue [14:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:00] it seems to be working again [14:02:08] !log UTC afternoon backport+config window done # likewise [14:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:13] * dcaro did not do anything xd [14:02:18] looks like only two logs were missed according to the IRC log, at least in this channel [14:02:23] huh, I didn’t do anything either [14:02:44] (03CR) 10Clément Goubert: [C: 04-1] "After discussion with joe, this dates back to HHVM and using fcgi over tcp. This can probably be removed completely since we now use unix " [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454) (owner: 10Dzahn) [14:02:54] looks like stashbot timed out and then reconnected(?) on its own [14:06:26] `2022-10-10T14:01:49Z Stashbot ERROR : LDAP server connection barfed; retrying` seems ldap related? (I'm not very familiar with the bot) [14:06:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T314041)', diff saved to https://phabricator.wikimedia.org/P35389 and previous config saved to /var/cache/conftool/dbconfig/20221010-140635-ladsgroup.json [14:06:40] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:07:31] dcaro: looks like all the other wmcs irc bots are flapping too, so not a stashbot specific issue? [14:07:55] probably not, but maybe LDAP made toolforge things shiver [14:08:10] stashbot, ircservserv, jouncebot, wmopbot all have timed out in last few minutes [14:08:10] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [14:08:40] and there goes wikibugs [14:08:48] oh... tools control servers are still trying to mount labstore1006/7 [14:08:52] (probably unrelated though) [14:10:56] I thought we'd already cleaned those up everywhere after the paws issues? [14:11:47] (03PS1) 10Elukey: admin_ng: set x-forwarded-proto for ml-serve TLS egress orig settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/841137 (https://phabricator.wikimedia.org/T320374) [14:11:50] I only manually cleaned up paws, so maybe not, I'll add a task so I don't forget [14:12:32] (03PS1) 10Giuseppe Lavagetto: etcd: add records compatible with the v3 etcd library [dns] - 10https://gerrit.wikimedia.org/r/841138 (https://phabricator.wikimedia.org/T320397) [14:12:59] (03CR) 10CI reject: [V: 04-1] admin_ng: set x-forwarded-proto for ml-serve TLS egress orig settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/841137 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:14:29] (03PS1) 10Hoo man: maintenance::wikidata: Update cron with lb and lb-pool [puppet] - 10https://gerrit.wikimedia.org/r/841148 [14:14:56] (03CR) 10CI reject: [V: 04-1] maintenance::wikidata: Update cron with lb and lb-pool [puppet] - 10https://gerrit.wikimedia.org/r/841148 (owner: 10Hoo man) [14:15:31] (03CR) 10Hoo man: "Should be abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/+/841148" [puppet] - 10https://gerrit.wikimedia.org/r/553097 (https://phabricator.wikimedia.org/T238751) (owner: 10Alaa Sarhan) [14:16:49] very weird, I got a -1 for the datahub chart [14:18:34] (03PS2) 10Giuseppe Lavagetto: Stop assigning the PHP_ENGINE cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839499 (https://phabricator.wikimedia.org/T271736) [14:18:51] elukey: check https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/841135 [14:19:07] elukey: this is related to helm update to 3.9.4 [14:19:16] claime: <3 [14:20:08] elukey: We just found out after updating the CI image to use 3.9.4, sorry about that [14:20:34] claime: does the above change need a bump in the Chart.yaml's version? [14:20:43] Ah maybe, idk. jayme ? [14:21:00] yeah I think it is needed [14:21:07] I don't know if default values change need bumping [14:21:29] they are part of the chart so IIRC yes [14:21:42] ok, adding version bump [14:22:39] (03PS2) 10Clément Goubert: datahub: Change elasticsearch host/port defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/841135 (https://phabricator.wikimedia.org/T317511) [14:22:45] On one hand, yes, but not strictly. CI does not care about that and a deployment for that chart needs to define the elasticsearch values anyway [14:23:07] So not needed, but recommended? [14:23:30] yeah. I think it's polite to bump :) [14:23:38] Bump's done anyways :p [14:23:45] eheh [14:24:04] (03CR) 10JMeybohm: [C: 03+1] datahub: Change elasticsearch host/port defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/841135 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [14:24:12] (03CR) 10Elukey: [C: 03+1] datahub: Change elasticsearch host/port defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/841135 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [14:25:09] jayme: mmm CI doesn't care due to the fact that it is stateless? (about helm releases) Or something else? [14:25:36] anyway, thanks claime :) [14:25:50] 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10ayounsi) For context: {T170369} [14:25:53] due to the fact that it runs on that's in git directly rather than the released artifacts in the chart repo [14:25:54] elukey: np, I break it I (try) to fix it ;) [14:26:14] jayme: ack thanks [14:26:35] claime: if you upgraded helm and this is the only issue then you did a good work :) [14:26:59] Oh it isn't upgraded on the actual servers *yet*, only in CI [14:27:09] I expect surprises [14:27:24] nah [14:27:40] you work with appservers, this is nothing [14:27:42] :D [14:27:54] (03PS1) 10Giuseppe Lavagetto: etcd: use the v3-style SRV record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841139 (https://phabricator.wikimedia.org/T320397) [14:27:54] Heh [14:28:16] (03PS2) 10Hoo man: maintenance::wikidata: Update cron with lb and lb-pool [puppet] - 10https://gerrit.wikimedia.org/r/841148 [14:28:25] (03CR) 10Clément Goubert: [C: 03+2] datahub: Change elasticsearch host/port defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/841135 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [14:28:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Lucas_Werkmeister_WMDE) Thanks, I can sudo to `analytics-privatedata` now. I can’t directly `kinit`, though: `lang=shell-session,counterexample lucaswerkmeister-w... [14:30:04] (03PS3) 10Hoo man: maintenance::wikidata: Update cron with lb and lb-pool [puppet] - 10https://gerrit.wikimedia.org/r/841148 [14:31:31] (03CR) 10Vgutierrez: "it looks like you've missed ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/841138 (https://phabricator.wikimedia.org/T320397) (owner: 10Giuseppe Lavagetto) [14:32:13] (03Merged) 10jenkins-bot: datahub: Change elasticsearch host/port defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/841135 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [14:32:25] elukey: you should be good, possibly need a rebase [14:33:04] yep thanks! [14:33:13] yw [14:33:16] (03PS2) 10Elukey: admin_ng: set x-forwarded-proto for ml-serve TLS egress orig settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/841137 (https://phabricator.wikimedia.org/T320374) [14:33:37] (03PS1) 10Hoo man: profile::lvs::configuration: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/841142 [14:36:33] (03CR) 10Jbond: [C: 03+2] "lgtm thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/841142 (owner: 10Hoo man) [14:37:10] 10SRE, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10MoritzMuehlenhoff) [14:38:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Ottomata) I think this was just overlooked. I just created your kerberos principal. You should have an email with instructions. [14:42:08] (03CR) 10Volans: "LGTM, question inline for the non-interactive mode of gnt commands" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [14:42:39] (03CR) 10Elukey: [C: 03+2] admin_ng: set x-forwarded-proto for ml-serve TLS egress orig settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/841137 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:42:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Lucas_Werkmeister_WMDE) 05In progress→03Resolved Thanks, I think that worked! [14:47:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:47:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:47:58] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:48:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:48:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:48:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:52:13] 10SRE, 10Infrastructure-Foundations: Design and implement async LDAP operations - https://phabricator.wikimedia.org/T320427 (10MoritzMuehlenhoff) [14:54:08] 10SRE, 10Infrastructure-Foundations: Initial IDM puppetisation - https://phabricator.wikimedia.org/T320428 (10MoritzMuehlenhoff) [14:55:47] 10SRE, 10Infrastructure-Foundations: Bug in bridge-utils breaks IPv6 on interface if its not part of a bridge but vlan sub-int of it is - https://phabricator.wikimedia.org/T320429 (10cmooney) p:05Triage→03Low [14:56:00] 10SRE, 10Infrastructure-Foundations: Create an initial IDM/LDAP image for tests and CI - https://phabricator.wikimedia.org/T320430 (10MoritzMuehlenhoff) [14:56:15] !log Updating helm3 to 3.9.4-1 on chartmuseum2001.codfw.wmnet,chartmuseum1001.eqiad.wmnet,contint[1001,2001].wikimedia.org,deploy2002.codfw.wmnet,deploy1002.eqiad.wmnet,releases2002.codfw.wmnet,releases1002.eqiad.wmnet [14:56:16] (03CR) 10Volans: "reply to john" [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [14:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:13] 10SRE, 10Infrastructure-Foundations: IDM: Central logging on all changes - https://phabricator.wikimedia.org/T320431 (10MoritzMuehlenhoff) [15:01:23] (03CR) 10Hnowlan: Update the logic to run code coverage (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik) [15:01:38] (03PS1) 10Arturo Borrero Gonzalez: cloudnet: merge host hiera overrides back into the profile [puppet] - 10https://gerrit.wikimedia.org/r/841145 [15:04:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: Test disk type change [15:04:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: Test disk type change [15:09:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubestagetcd1004.eqiad.wmnet with reason: Test disk type change [15:09:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubestagetcd1004.eqiad.wmnet with reason: Test disk type change [15:11:54] !log installing fribidi security updates [15:14:57] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [15:16:46] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:18:59] 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) To clarify, I guess the question I was interested to know if people had opinions on was whether it would be a bad ide... [15:22:15] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@d1e6a2d]: (no justification provided) [15:22:28] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@d1e6a2d]: (no justification provided) (duration: 00m 13s) [15:23:34] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:29:28] (03CR) 10Dduvall: "I sure made a mess with this one, didn't I? I'll submit a follow-up to fix the uppercased environment variables." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T1530). [15:32:00] (03CR) 10Btullis: [C: 03+1] "Late to the party, but thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/841135 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [15:33:57] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@60aa96c]: (no justification provided) [15:34:10] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@60aa96c]: (no justification provided) (duration: 00m 12s) [15:38:41] (03PS1) 10Dduvall: P:gitlab::runner: Quote uppercase environment variable hash keys [puppet] - 10https://gerrit.wikimedia.org/r/841171 (https://phabricator.wikimedia.org/T317997) [15:39:15] (03PS2) 10Dduvall: P:gitlab::runner: Quote environment variable hash keys [puppet] - 10https://gerrit.wikimedia.org/r/841171 (https://phabricator.wikimedia.org/T317997) [15:39:30] 10SRE, 10Discovery-Search (Current work): Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo - https://phabricator.wikimedia.org/T318820 (10MPhamWMF) [15:39:32] (03CR) 10Hnowlan: [C: 03+2] Add missing prod dependencies [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/839548 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:40:08] (03CR) 10Volans: "I'll suggest to setup as soon as possible CI on this repo, so that most nits can be picked automatically by the CI and don't require a lot" [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [15:41:47] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks for the reviews!" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [15:49:14] (03PS6) 10Gehel: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking) [15:50:54] (03CR) 10Dduvall: P:gitlab::runner: Provide proxy variables to runner jobs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [15:52:54] (03Merged) 10jenkins-bot: Add missing prod dependencies [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/839548 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:04:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:05:13] (03CR) 10Dduvall: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [16:05:40] jouncebot: nowandnext [16:05:40] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [16:05:40] In 0 hour(s) and 54 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T1700) [16:06:27] * urbanecm sneaks a deployment in [16:06:40] (03PS2) 10Urbanecm: eswiki: Enable Growth mentorship for 25% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839485 (https://phabricator.wikimedia.org/T285235) [16:06:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839485 (https://phabricator.wikimedia.org/T285235) (owner: 10Urbanecm) [16:08:21] (03Merged) 10jenkins-bot: eswiki: Enable Growth mentorship for 25% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839485 (https://phabricator.wikimedia.org/T285235) (owner: 10Urbanecm) [16:08:34] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:839485|eswiki: Enable Growth mentorship for 25% of new accounts (T285235)]] [16:08:39] T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235 [16:08:54] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:839485|eswiki: Enable Growth mentorship for 25% of new accounts (T285235)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [16:09:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:11:09] 10SRE, 10serviceops: eqiad (2) memcached host for wikifunctions service implementation tracking - https://phabricator.wikimedia.org/T313965 (10Joe) p:05Triage→03Medium [16:11:13] 10SRE, 10serviceops: eqiad (2) memcached host for wikifunctions service implementation tracking - https://phabricator.wikimedia.org/T313965 (10Joe) a:05Joe→03None [16:13:26] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:839485|eswiki: Enable Growth mentorship for 25% of new accounts (T285235)]] (duration: 04m 46s) [16:15:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:16:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:16:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:17:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:18:19] can someone restart stashbot please? [16:18:59] see https://wikitech.wikimedia.org/wiki/Tool:Stashbot#Maintenance for docs [16:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:45:32] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [16:47:34] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:51:39] (03PS1) 10Jbond: casLoginView.html: Add original file from cas 6.6.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841180 [16:51:41] (03PS1) 10Jbond: casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 [16:53:22] (03PS2) 10Jbond: casLoginView.html: Add original file from cas 6.6.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841180 [16:54:03] (03CR) 10Jbond: "Im adding this first so we have history of our changes in git" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841180 (owner: 10Jbond) [16:54:12] (03PS2) 10Jbond: casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 [16:55:37] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4008 [16:56:02] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4008 [16:56:22] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:58:53] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:05] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T1700). [17:00:44] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4008.mgmt.ulsfo.wmnet with reboot policy FORCED [17:01:11] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ganeti4008.mgmt.ulsfo.wmnet with reboot policy FORCED [17:09:43] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4008.mgmt.ulsfo.wmnet with reboot policy FORCED [17:17:01] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4008.mgmt.ulsfo.wmnet with reboot policy FORCED [17:17:54] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4008.mgmt.ulsfo.wmnet with reboot policy FORCED [17:20:36] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4008.mgmt.ulsfo.wmnet with reboot policy FORCED [17:26:35] (03PS1) 10RobH: ganeti4008 setup [puppet] - 10https://gerrit.wikimedia.org/r/841182 (https://phabricator.wikimedia.org/T317247) [17:27:37] (03PS2) 10RobH: ganeti4008 setup [puppet] - 10https://gerrit.wikimedia.org/r/841182 (https://phabricator.wikimedia.org/T317247) [17:28:03] (03CR) 10RobH: [C: 03+2] ganeti4008 setup [puppet] - 10https://gerrit.wikimedia.org/r/841182 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH) [17:33:25] (03CR) 10BCornwall: [C: 03+2] ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [17:33:31] (03PS10) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) [17:35:03] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [17:52:08] PROBLEM - IPMI Sensor Status on clouddumps1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Inlet Temp = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:53:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [17:56:22] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4008.ulsfo.wmnet with OS bullseye [17:56:29] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4008.ulsfo.wmnet with OS bullseye [18:18:07] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage [18:21:44] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage [18:28:28] 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) a:03RobH [18:30:59] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns4002.wikimedia.org [18:34:31] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:34:33] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:35:09] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:35:47] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:37:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:38:33] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4008.ulsfo.wmnet with OS bullseye [18:38:40] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4008.ulsfo.wmnet with OS bullseye completed: - ganeti4... [18:39:23] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:39:24] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns4002.wikimedia.org [18:39:28] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `dns4002.wikimedia.org` - dns4002.wikimedia.org (**PASS**) - Downtimed host on Icinga/Aler... [18:40:47] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:42:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:44:08] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) a:05RobH→03ssingh This is ready for full decom from puppet repo and resolution. [18:44:20] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) [18:44:37] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) [18:44:41] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [18:50:11] (03PS1) 10Vlad.shapik: WIP: Provide additional tests to cover errors and exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) [18:51:21] (03PS2) 10Vlad.shapik: WIP: Provide additional tests to cover errors and exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) [18:51:43] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:53:29] RECOVERY - IPMI Sensor Status on clouddumps1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:54:15] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:02:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:07:23] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4004.mgmt.ulsfo.wmnet with reboot policy FORCED [19:13:19] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [19:13:57] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns4004 [19:14:08] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns4004.mgmt.ulsfo.wmnet with reboot policy FORCED [19:14:13] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns4004 [19:14:40] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:16:45] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:18:19] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [19:19:55] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:21:45] 10SRE, 10Traffic: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) 05In progress→03Resolved [19:22:26] (03PS1) 10RobH: dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/841185 (https://phabricator.wikimedia.org/T317247) [19:22:47] (03CR) 10RobH: [C: 03+2] dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/841185 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH) [19:23:02] (03CR) 10CI reject: [V: 04-1] dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/841185 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH) [19:24:48] (03PS2) 10RobH: dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/841185 (https://phabricator.wikimedia.org/T317247) [19:24:58] (03CR) 10RobH: [C: 03+2] dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/841185 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH) [19:41:51] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T2000) [20:00:05] Aishik: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] o/ [20:00:25] aishik: are you around? [20:02:13] (03CR) 10Urbanecm: [C: 04-1] "svgs are not minimized" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841151 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:03:02] (03CR) 10Urbanecm: [C: 04-1] Resize wordmark and tagline of Bengali Wikibooks Bug: T319320 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841151 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:03:05] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841151 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:08:22] hi Aishik! please see the code review i submitted for your patch a while ago, let me know if you have any questions. [20:09:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:10:16] Which one requires to minimize? [20:10:24] wordmark or tagline? [20:10:43] Aishik: both :) [20:12:47] Hope its okay now........... [20:13:07] kindly recheck....... [20:13:42] Be aware of failing bots, wikibugs, stashbot, and wmopbot among others [20:14:54] Sariboo: that's currently tracked under T320446 (fyi, -cloud is generally a better channels for bots) [20:15:25] urbanecm: yes, but I was letting anyone that's here but not in -cloud know [20:17:05] Aishik: I'm sorry, the file still looks not minimized to me? [20:17:57] but I did (: [20:19:36] Aishik: what command did you run please? [20:20:52] Can you do it for me, please............ [20:20:56] ? [20:21:37] I did it! but idk why its still non-minimized! [20:23:36] Aishik: i'm trying to help (and teach you how you can do it next time), but i need to know how did you do the minimization :) [20:24:36] i think minimized means svg means minified svg [20:24:47] i think minimized svg means minified svg [20:25:40] Aishik: it does, but the svg file in the patch's not minified [20:25:40] So I just minified it [20:25:50] the docs are at https://www.mediawiki.org/wiki/Manual:Assets#SVG_files [20:25:54] Aishik: how did you minify it? [20:26:28] Just use a 3rd party site https://www.svgminify.com/ [20:26:37] *used [20:27:33] i see. we use the svgo tool (https://www.mediawiki.org/wiki/Manual:Assets#SVG_files) [20:27:53] i can run it for you this time, unless you want to try it out today, to ensure you can use it? [20:28:17] Aishik: ^ [20:29:46] Oh!  I was not familiar with this  tool! Can you share the method in details......... [20:31:52] Aishik: can you please ask specific questions? the details are at https://www.mediawiki.org/wiki/Manual:Assets#SVG_files :) [20:32:19] also, as i said, i can run svgo for you today (and deploy the patch), and you can learn it later for future patchs [20:32:29] let me know what you prefer Aishik [20:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:33:23] run it for me 😣 [20:33:39] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:58] Later I will learn how to use the umbrella 😌 [20:34:14] sounds good [20:35:14] let's deploy it [20:36:59] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841151|Resize wordmark and tagline of Bengali Wikibooks (T319320)]] [20:37:18] !log urbanecm@deploy1002 urbanecm and aishik: Backport for [[gerrit:841151|Resize wordmark and tagline of Bengali Wikibooks (T319320)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:37:55] Aishik: the change's at mwdebug1001. can you test it there please? [20:38:15] wait a second, please [20:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:38:19] sure [20:40:00] I think it works! [20:40:18] great, syncing! [20:40:27] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10ArielGlenn) >>! In T309346#8300585, @nskaggs wrote: > Given the new machines much larger capacity, I believe any pending req... [20:40:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:41:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:41:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:41:56] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS bullseye [20:42:03] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4004.wikimedia.org with OS bullseye [20:44:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:44:28] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841151|Resize wordmark and tagline of Bengali Wikibooks (T319320)]] (duration: 07m 29s) [20:44:32] Aishik: should be live! [20:44:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:33] T319320: Add wordmark and tagline for Bengali Wikibooks - https://phabricator.wikimedia.org/T319320 [20:49:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:54:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:59:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:59:19] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221010T2100). [21:02:46] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [21:10:38] PROBLEM - IPMI Sensor Status on clouddumps1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Inlet Temp = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:19:09] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4004.wikimedia.org with OS bullseye [21:19:20] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4004.wikimedia.org with OS bullseye completed: - dns4004... [21:19:55] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [21:21:15] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) @Ssingh: dns4004 installed fine, so its ready for role and reimage as needed by #traffic. I also kicked the dns4002 decom task over to you for pu... [21:24:41] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10nskaggs) The new boxes are installed and storage should no longer be an issue. What is needed to proce... [21:28:28] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:41:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:24:33] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MichaelSchoenitzer) [22:43:30] RECOVERY - IPMI Sensor Status on clouddumps1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [23:12:12] (03PS1) 10Zabe: nginx: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/841196 (https://phabricator.wikimedia.org/T308013) [23:13:30] (03Abandoned) 10Zabe: nginx: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/841196 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [23:14:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:19:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:29:38] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown