[00:01:19] <mutante>	 sirenbot: do your thing
[00:34:10] <icinga-wm>	 PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:52] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd0d52d5280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[00:34:52] <icinga-wm>	 org/wiki/Search%23Administration
[00:35:44] <icinga-wm>	 RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:36:28] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 667, active_shards: 1509, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
[00:36:28] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:46:20] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:04:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[01:09:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[01:37:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:46] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:20] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[01:50:52] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] hiera: move swift credentials into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[01:57:46] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:59:26] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] swift: move accounts_keys to common hiera global_account_keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[02:07:46] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[02:17:46] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:19:37] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix exception in `<gallery mode="slideshow">` with missing images [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878154
[02:22:46] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] kafka-logging: add kafka-logging200[45] to codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[02:46:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 199 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:48:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:57:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:02:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:47:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:52:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:55:17] <wikibugs>	 10SRE, 10SRE-OnFire, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10lmata) 05Open→03Resolved a:03lmata   >>! In T266807#8495179, @akosiaris wrote: > Thi...
[04:13:30] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[04:16:36] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:21:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:23:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:28:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:37:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:37:54] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:47:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:48:26] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet
[05:48:28] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet
[05:52:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:54:31] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet
[05:55:07] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet
[06:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[06:33:17] <wikibugs>	 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) After collecting some correct data, and discussing the matter with @Krinkle ,  we don't think we have a strict need for onhost memcached at the moment if not for releivin...
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T0700)
[07:09:53] <wikibugs>	 (03CR) 10Ayounsi: P:environment: Add ablilty to inject environment variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[07:16:23] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) From JTAC: > This message “Read-only file system” suggest file system issues. I found one case with same behavior and the upgrade had to do it with...
[07:17:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) 05Open→03Resolved All good, thanks a lot!
[07:43:54] <icinga-wm>	 PROBLEM - SSH on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:51:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:56:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:57:20] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2010 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:58:36] <wikibugs>	 (03PS1) 10JMeybohm: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292)
[07:58:38] <wikibugs>	 (03PS1) 10JMeybohm: cert-manager: Set leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/878752 (https://phabricator.wikimedia.org/T325292)
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T0800). Please do the needful.
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:02:06] <kart_>	 Looks like I didn't put patch in proper window :D
[08:02:49] <kart_>	 Oh, didn't put my name in the ircnick template :/
[08:03:24] <kart_>	 I'll go ahead with only patch I've for backport.
[08:04:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877223 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry)
[08:06:20] <icinga-wm>	 PROBLEM - SSH on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:08:22] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2010 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:17] <wikibugs>	 (03CR) 10Ayounsi: O:rpkivalidator: add bgpalerter to rpki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[08:11:34] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2011 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:12:08] <wikibugs>	 (03CR) 10Muehlenhoff: admin: Add Jennifer Hancock to the datacenter-ops group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[08:13:46] <icinga-wm>	 RECOVERY - SSH on wdqs2010 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:14:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/878063 (https://phabricator.wikimedia.org/T326316) (owner: 10Jbond)
[08:15:14] <icinga-wm>	 PROBLEM - SSH on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:18:38] <icinga-wm>	 PROBLEM - SSH on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:19:23] <wikibugs>	 (03Merged) 10jenkins-bot: CX: Fix transformation of TranslationUnitDTO to custom array [extensions/ContentTranslation] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877223 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry)
[08:19:59] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:877223|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]]
[08:20:02] <stashbot>	 T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278
[08:21:48] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:877223|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[08:25:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/878014 (https://phabricator.wikimedia.org/T325387) (owner: 10Muehlenhoff)
[08:25:30] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:27:43] <wikibugs>	 (03CR) 10MVernon: swift: move accounts_keys to common hiera global_account_keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[08:28:52] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:34] <icinga-wm>	 RECOVERY - SSH on wdqs2010 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:30:26] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:31:44] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:877223|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] (duration: 11m 45s)
[08:31:47] <stashbot>	 T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278
[08:32:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ferm to 2.5.1 - https://phabricator.wikimedia.org/T248954 (10MoritzMuehlenhoff) 05Open→03Declined We won't update Buster hosts to 2.5.1 anymore, these will only be around for some more months anyway and all energy is better spent on migrating these systems to Bu...
[08:32:28] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:23] <wikibugs>	 (03PS1) 10Ayounsi: depool eqsin for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/878854 (https://phabricator.wikimedia.org/T316532)
[08:34:24] <wikibugs>	 (03CR) 10MVernon: swift: move accounts_keys to common hiera global_account_keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[08:34:36] <icinga-wm>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:34:39] <wikibugs>	 (03PS2) 10Muehlenhoff: os-reports: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870558
[08:34:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] depool eqsin for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/878854 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi)
[08:35:02] <kart_>	 No more patches in UTC morning backport window.
[08:36:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870558 (owner: 10Muehlenhoff)
[08:38:18] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:39:28] <icinga-wm>	 PROBLEM - SSH on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:42:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:43:26] <icinga-wm>	 RECOVERY - SSH on wdqs2012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:48:22] <icinga-wm>	 PROBLEM - SSH on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:48:22] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: trafficserver: remove support for "php7.4" option in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867541
[08:50:18] <icinga-wm>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:51:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[08:51:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: move swift credentials into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[08:52:12] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:52:20] <wikibugs>	 (03PS4) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[08:53:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] swift: move accounts_keys to common hiera global_account_keys [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[08:53:37] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Looks good to me, please deploy at any time." [puppet] - 10https://gerrit.wikimedia.org/r/878186 (owner: 10Eevans)
[08:58:05] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper)
[08:58:23] <wikibugs>	 (03PS2) 10Gehel: wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper)
[08:59:16] <icinga-wm>	 RECOVERY - SSH on wdqs2012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:15:48] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:00] <wikibugs>	 (03PS7) 10Hashar: Display Zuul status of jobs for a change in Gerrit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859127 (https://phabricator.wikimedia.org/T214068)
[09:17:58] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] httpbb: add SPDX license headers for some test files [puppet] - 10https://gerrit.wikimedia.org/r/878205 (owner: 10Dzahn)
[09:23:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:25:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: remove support for "php7.4" option in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867541 (owner: 10Giuseppe Lavagetto)
[09:26:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] cassandra: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870847 (owner: 10Muehlenhoff)
[09:28:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:28:58] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:01] <wikibugs>	 (03PS1) 10Muehlenhoff: package_builder: Also install the hooks for sid [puppet] - 10https://gerrit.wikimedia.org/r/878856
[09:32:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/878856 (owner: 10Muehlenhoff)
[09:33:46] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:36:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Also install the hooks for sid [puppet] - 10https://gerrit.wikimedia.org/r/878856 (owner: 10Muehlenhoff)
[09:41:02] <jynus>	 gehel: I think this is something you may know about, but please correct me if it is the wrong team. WDQS has an outdated SPARQ check, should I file a ticket about that?
[09:49:28] <moritzm>	 !log installing python3.7 security updates
[09:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:32] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus: Fix job:haproxy_frontend_http_responses_total:rate2m [puppet] - 10https://gerrit.wikimedia.org/r/878858 (https://phabricator.wikimedia.org/T288196)
[09:55:12] <wikibugs>	 (03CR) 10Hashar: opensearch: make upgrade-phatality.sh stricter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar)
[09:57:12] <wikibugs>	 (03PS3) 10Hashar: opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440)
[09:57:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, by convention the metric should be job_code:etcetc (i.e. list the aggregation variables). Though in this case we have already the va" [puppet] - 10https://gerrit.wikimedia.org/r/878858 (https://phabricator.wikimedia.org/T288196) (owner: 10Vgutierrez)
[09:57:36] <wikibugs>	 (03PS1) 10Jelto: add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655)
[09:57:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn)
[09:58:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn)
[09:58:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) (owner: 10Jelto)
[09:58:36] <wikibugs>	 (03CR) 10Hashar: opensearch: make upgrade-phatality.sh stricter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar)
[09:59:42] <wikibugs>	 (03PS2) 10Jelto: add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655)
[10:01:06] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] prometheus: Fix job:haproxy_frontend_http_responses_total:rate2m [puppet] - 10https://gerrit.wikimedia.org/r/878858 (https://phabricator.wikimedia.org/T288196) (owner: 10Vgutierrez)
[10:02:26] <XioNoX>	 !log asw1-eqsin> request system reboot all-members - T316532
[10:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:29] <stashbot>	 T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532
[10:04:34] <wikibugs>	 (03PS24) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[10:05:00] <icinga-wm>	 PROBLEM - VRRP status on cr3-eqsin is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[10:05:35] <XioNoX>	 expected
[10:05:38] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 10 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:06:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39057/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[10:06:32] <jinxer-wm>	 (virtual-chassis crash) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[10:07:29] <XioNoX>	 you can ignore that too
[10:07:38] <wikibugs>	 (03PS3) 10Jelto: add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655)
[10:07:56] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[10:08:39] <XioNoX>	 says codfw, so I guess it's not related ^
[10:08:50] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Connect -
[10:08:50] <icinga-wm>	 , AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:08:59] <XioNoX>	 expected ^
[10:09:00] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[10:09:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) (owner: 10Jelto)
[10:11:41] <wikibugs>	 (03PS25) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[10:12:44] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:13:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39058/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[10:13:18] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:13:55] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) (owner: 10Jelto)
[10:13:57] <wikibugs>	 (03PS26) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[10:13:59] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: add authorizationHeader and use yml vs yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/878865
[10:14:08] <icinga-wm>	 RECOVERY - VRRP status on cr3-eqsin is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[10:15:02] <wikibugs>	 (03PS1) 10Zabe: Simplify expensive check [extensions/3D] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878160 (https://phabricator.wikimedia.org/T326690)
[10:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[10:15:20] <XioNoX>	 half of the switch stack came back online fine...
[10:15:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39059/console" [puppet] - 10https://gerrit.wikimedia.org/r/878865 (owner: 10Jbond)
[10:16:36] <moritzm>	 !log installing postgresql-11 security updates
[10:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:38] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:17:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10Jelto) p:05Triage→03Medium a:03Jelto Thanks for opening the request.  I added you to: * ldap/wmf group * wmf-nda phabricator group * to [data.yaml](https://gerrit...
[10:17:32] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 5 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:18:25] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye
[10:18:33] <wikibugs>	 (03PS2) 10Zabe: Start reading from cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877249 (https://phabricator.wikimedia.org/T233004)
[10:18:48] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Start reading from cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877249 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[10:18:52] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Simplify expensive check [extensions/3D] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878160 (https://phabricator.wikimedia.org/T326690) (owner: 10Zabe)
[10:19:11] <XioNoX>	 I'll have to follow up with jtac, something is busted on one of the two switches...
[10:19:38] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877249 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[10:20:51] <wikibugs>	 (03Merged) 10jenkins-bot: Simplify expensive check [extensions/3D] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878160 (https://phabricator.wikimedia.org/T326690) (owner: 10Zabe)
[10:21:24] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:878160|Simplify expensive check (T326690)]], [[gerrit:877249|Start reading from cuc_actor on test wikis (T233004)]]
[10:21:29] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[10:21:29] <stashbot>	 T326690: PHP Deprecated: HTMLFormField::__construct: Constructing an HTMLFormField without a 'parent' parameter [Called from Licenses::__construct] - https://phabricator.wikimedia.org/T326690
[10:21:32] <jinxer-wm>	 (virtual-chassis crash) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[10:23:13] <logmsgbot>	 !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:878160|Simplify expensive check (T326690)]], [[gerrit:877249|Start reading from cuc_actor on test wikis (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[10:23:39] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid test cluster: Reboot Druid nodes
[10:24:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw1486.eqiad.wmnet with reason: hardware troubleshooting
[10:25:02] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw1486.eqiad.wmnet with reason: hardware troubleshooting
[10:25:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=edb03633-d9b6-4a06-849d-2c3da0e62688) set by cgoubert@cumin1001 for 7 days,...
[10:26:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10Jelto) According to https://www.mediawiki.org/wiki/Gerrit/Privilege_policy you should also have Gerrit +2 from your ldap/wmf membership.  So you should have the request...
[10:26:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10Jelto) a:05Jelto→03StephaneRebai
[10:26:43] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: multi-processing changes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866
[10:27:14] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: multi-processing changes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624)
[10:29:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Only nits/minor things really" [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[10:29:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] bgpalerter: add authorizationHeader and use yml vs yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/878865 (owner: 10Jbond)
[10:30:59] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:878160|Simplify expensive check (T326690)]], [[gerrit:877249|Start reading from cuc_actor on test wikis (T233004)]] (duration: 09m 34s)
[10:31:04] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[10:31:04] <stashbot>	 T326690: PHP Deprecated: HTMLFormField::__construct: Constructing an HTMLFormField without a 'parent' parameter [Called from Licenses::__construct] - https://phabricator.wikimedia.org/T326690
[10:32:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[10:34:21] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage
[10:36:29] <wikibugs>	 (03PS27) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[10:36:31] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: add defaults for bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/878867
[10:36:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] bgpalerter: add defaults for bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/878867 (owner: 10Jbond)
[10:37:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage
[10:37:49] <bawolff>	 zabe: Thanks for backporting the follow up on the HTMLForm thing :)
[10:39:26] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[10:39:28] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) fpc0 went back up fine, but fpc1 not so much... It's not fully booting and stuck at a busybox like shell. Root password works so that means the con...
[10:40:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[10:41:46] <zabe>	 yw
[10:43:40] <bawolff>	 huh, redlinks are broken on mw.org
[10:44:22] <bawolff>	 or maybe just in flow
[10:44:26] <wikibugs>	 (03CR) 10Jcrespo: "Could you clarify comments 1 and 2, 3 I will fix right away." [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[10:44:53] <wikibugs>	 (03PS28) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[10:44:55] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868
[10:45:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868 (owner: 10Jbond)
[10:45:44] <wikibugs>	 (03PS1) 10Volans: dhcp: fix tests using unnecessary hack [software/spicerack] - 10https://gerrit.wikimedia.org/r/878869
[10:45:58] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "That's actually kinda correct as a report, the there is an error in the tests. I've sent I8adace301ff730e5f311ea233266565946f0d9ae to fix " [software/spicerack] - 10https://gerrit.wikimedia.org/r/878172 (owner: 10Jbond)
[10:46:02] <wikibugs>	 (03PS1) 10Zabe: Start reading from cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878870 (https://phabricator.wikimedia.org/T233004)
[10:47:21] <gehel>	 jynus: sorry for the delay. Yes, please create a phab task. How is it outdated?
[10:47:41] <jynus>	 I just diffed deeper and the check is right, I was confused by it
[10:47:45] <jynus>	 *digged
[10:47:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF)
[10:48:04] <jynus>	 I wonder if it is T323096 and an expired downtime, gehel
[10:48:04] <stashbot>	 T323096: WDQS Data Reload - https://phabricator.wikimedia.org/T323096
[10:48:22] <jynus>	 in that case, just a new longer downtime should do the trick
[10:48:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Design and implement async LDAP operations - https://phabricator.wikimedia.org/T320427 (10SLyngshede-WMF) 05In progress→03Resolved We're going with Django-RQ as it's simpler and does not require Celery.
[10:48:39] <jynus>	 wdqs is returning 400 on those hosts, hence the error
[10:49:12] <gehel>	 Oh, might be that data reload. I'll have a look (cc inflatador, ryankemper)
[10:49:15] <jynus>	 I was about to ask on T323096
[10:49:32] <jynus>	 maybe this is not something you are in charge of
[10:49:44] <wikibugs>	 (03PS2) 10Jbond: bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868
[10:50:05] <jinxer-wm>	 (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_aux-k8s-ingress.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:50:39] <wikibugs>	 (03PS3) 10Jbond: bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868
[10:50:41] <wikibugs>	 (03PS29) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[10:52:19] <gehel>	 jynus: Ryan and Brian are working on that data reload, and it has not been going as planned :/ But I have some knowledge of what's going on.
[10:52:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868 (owner: 10Jbond)
[10:52:29] <gehel>	 I'll extend the downtime
[10:52:44] <jynus>	 oh, I had just commented: https://phabricator.wikimedia.org/T323096#8516090
[10:53:05] <jynus>	 you can also ack, which will disable the alerts until they work again so they don't expire, up to you
[10:53:32] <jynus>	 feel free to comment there if you take action so they don't have to
[10:53:56] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: multi-processing changes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[10:54:09] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:54:12] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1001.eqiad.wmnet with OS bullseye
[10:55:05] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_aux-k8s-ingress.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:55:35] <gehel>	 jynus: It's related more to T301167. I've added a week of downtime (cc: inflatador, ryankemper)
[10:55:35] <stashbot>	 T301167: Service implementation for wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T301167
[10:56:00] <jynus>	 I see, thank you!
[10:56:07] <jynus>	 sorry for the ping
[10:56:49] <jynus>	 so initially I had thought that the string returned was outdated and the check needed changes
[10:57:23] <jynus>	 but it turned it was a service returning 400 code when I digged deeper
[10:58:36] <wikibugs>	 (03PS10) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795)
[10:59:10] <wikibugs>	 (03PS11) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1100)
[11:00:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] docker::baseimages: inject no_proxy config to rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/878063 (https://phabricator.wikimedia.org/T326316) (owner: 10Jbond)
[11:00:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Limit the installed hooks for sid [puppet] - 10https://gerrit.wikimedia.org/r/878871
[11:04:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/878869 (owner: 10Volans)
[11:04:25] <wikibugs>	 (03Abandoned) 10Jbond: dhcp: disable no-member check [software/spicerack] - 10https://gerrit.wikimedia.org/r/878172 (owner: 10Jbond)
[11:04:30] <wikibugs>	 (03PS12) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795)
[11:05:24] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39061/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[11:06:09] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:08:18] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] aux_k8s::worker: Include P::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/878872 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[11:09:51] <wikibugs>	 (03PS13) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795)
[11:12:41] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 80 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish
[11:12:41] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3127 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish
[11:12:41] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3126 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish
[11:12:41] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3125 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish
[11:12:41] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3124 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish
[11:12:45] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1003.eqiad.wmnet with OS bullseye
[11:13:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/878871 (owner: 10Muehlenhoff)
[11:14:03] <wikibugs>	 (03PS12) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[11:14:04] <icinga-wm>	 ACKNOWLEDGEMENT - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/DNS
[11:14:04] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:14:04] <icinga-wm>	 ACKNOWLEDGEMENT - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/NTP
[11:14:04] <icinga-wm>	 ACKNOWLEDGEMENT - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - query timed out ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/DNS
[11:14:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host dns5004 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T316532
[11:15:02] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.druid.reboot-workers (exit_code=99) for Druid test cluster: Reboot Druid nodes
[11:15:44] <wikibugs>	 (03CR) 10Volans: [C: 03+2] dhcp: fix tests using unnecessary hack [software/spicerack] - 10https://gerrit.wikimedia.org/r/878869 (owner: 10Volans)
[11:15:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[11:15:59] <icinga-wm>	 ACKNOWLEDGEMENT - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T316532
[11:16:01] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99)
[11:16:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool-data: Add aux-k8s-workers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/878874 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[11:16:19] <icinga-wm>	 ACKNOWLEDGEMENT - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/DNS
[11:16:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] conftool-data: Add aux-k8s-workers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/878874 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[11:17:07] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] conftool-data: Add aux-k8s-workers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/878874 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[11:18:44] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 5 ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:18:44] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Connect - Anycast ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:18:44] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 5 ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:18:44] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:18:50] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: move authorizationHeader to ris section [puppet] - 10https://gerrit.wikimedia.org/r/878875
[11:19:15] <wikibugs>	 (03Merged) 10jenkins-bot: dhcp: fix tests using unnecessary hack [software/spicerack] - 10https://gerrit.wikimedia.org/r/878869 (owner: 10Volans)
[11:19:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast3006.wikimedia.org
[11:19:26] <wikibugs>	 (03PS3) 10Volans: puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[11:19:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:19:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39063/console" [puppet] - 10https://gerrit.wikimedia.org/r/878875 (owner: 10Jbond)
[11:19:59] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper virtual chassis ports on asw1-eqsin is CRITICAL: CRIT: Down: 2 Unknown: 0 ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[11:20:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878871 (owner: 10Muehlenhoff)
[11:21:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] bgpalerter: move authorizationHeader to ris section [puppet] - 10https://gerrit.wikimedia.org/r/878875 (owner: 10Jbond)
[11:21:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Limit the installed hooks for sid [puppet] - 10https://gerrit.wikimedia.org/r/878871 (owner: 10Muehlenhoff)
[11:21:47] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1038.eqiad.wmnet with OS bullseye
[11:22:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[11:22:57] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[11:22:59] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99)
[11:23:08] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) We tried to boot on the Recovery Junos (both 14 and 20) but the same error happened.  Next step is onsite "format install" https://supportportal.ju...
[11:24:56] <wikibugs>	 (03CR) 10Volans: "The approach LGTM, couple of nits inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[11:25:49] <icinga-wm>	 RECOVERY - Check systemd state on people2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:26:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: check_legal_terms: Refactor check to make it more robust against changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[11:28:12] <wikibugs>	 (03PS1) 10Jbond: cache::base: move wikimedia and wmcs domains to global level [puppet] - 10https://gerrit.wikimedia.org/r/878876
[11:28:41] <icinga-wm>	 RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3006.wikimedia.org - jmm@cumin2002"
[11:29:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3006.wikimedia.org - jmm@cumin2002"
[11:29:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:29:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast3006.wikimedia.org on all recursors
[11:29:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39064/console" [puppet] - 10https://gerrit.wikimedia.org/r/878876 (owner: 10Jbond)
[11:30:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) bast3006.wikimedia.org on all recursors
[11:33:05] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage
[11:36:08] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage
[11:38:40] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes:weight=10; selector: cluster=aux-k8s,service=kubesvc
[11:39:23] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] "Only one suggestion regarding the commit message :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[11:40:33] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877
[11:40:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to various base/IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/863305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:40:57] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] cache::base: move wikimedia and wmcs domains to global level [puppet] - 10https://gerrit.wikimedia.org/r/878876 (owner: 10Jbond)
[11:41:17] <wikibugs>	 (03PS14) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795)
[11:41:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) @clements_goubert I checked yesterday afternoon did not see any alerts. Let’s repool server close ticket
[11:41:35] <jbond>	 moritzm: FI ill merge your SPDX changes as well
[11:41:40] <moritzm>	 please do
[11:41:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) 05In progress→03Resolved
[11:41:46] * jbond done
[11:41:53] <moritzm>	 thx
[11:41:56] <jbond>	 np
[11:42:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877 (owner: 10Jbond)
[11:43:08] <wikibugs>	 (03PS13) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[11:43:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Add SPDX headers for various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/860912 (https://phabricator.wikimedia.org/T308013)
[11:44:43] <wikibugs>	 (03CR) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[11:44:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[11:47:15] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1486.eqiad.wmnet
[11:48:23] <wikibugs>	 (03PS1) 10Muehlenhoff: package_builder::pbuilder_hook: Manage the hook directory with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878879
[11:49:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast3006.wikimedia.org
[11:50:05] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1486 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[11:50:20] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1486.eqiad.wmnet
[11:50:20] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1486.eqiad.wmnet
[11:50:35] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage
[11:50:36] <jinxer-wm>	 (ConfdResourceFailed) resolved: (2) confd resource _srv_config-master_pybal_eqiad_aux-k8s-ingress.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:51:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[11:51:23] <claime>	 !log repooled mw1486 in api_appserver eqiad after hardware investigation - T326425
[11:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:26] <stashbot>	 T326425: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425
[11:51:29] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[11:51:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) Server repooled, thanks a bunch.
[11:52:33] <wikibugs>	 (03CR) 10Volans: "LGTM but I'll leave it to John for the review of the intricacies of recurse/purge" [puppet] - 10https://gerrit.wikimedia.org/r/878879 (owner: 10Muehlenhoff)
[11:53:26] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: multi-processing changes for articlequality and drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624)
[11:53:39] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage
[11:54:45] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: multi-processing changes for articlequality and drafttopic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[11:55:35] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881
[11:55:59] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: multi-processing changes for articlequality and drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[11:57:21] <icinga-wm>	 PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:58:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp5018.eqsin.wmnet, cp5022.eqsin.wmnet are marked down but pooled: uploadlb_80: Servers cp5028.eqsin.wmnet, cp5030.eqsin.wmnet, cp5032.eqsin.wmnet are marked down but pooled: testlb_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb_443: Servers cp5024.eqsi
[11:58:17] <icinga-wm>	  cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_80: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet are marked down but pooled: testlb6_80: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet are marked down but pooled: uploadlb6_80: Servers cp5028.eqsin.wmnet, cp5030.eqsin.wmnet, cp5032.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5026.eqsin.wmnet, cp5030.eqsin.wmnet, cp5032.eqs
[11:58:17] <icinga-wm>	  are marked down but pooled: uploadlb6_443: Servers cp5028.eqsin.wmnet, cp5026.eqsin.wmnet, cp5030.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5022 https://wikitech.wikimedia.org/wiki/PyBal
[11:59:06] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a respons
[11:59:06] <icinga-wm>	 ceived: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[12:00:26] <jynus>	 why delayed alerts, godog ? Do they have a higher timeout?
[12:00:37] <wikibugs>	 (03PS1) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884
[12:01:08] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[12:01:12] <jynus>	 or maybe it is a new isue in in sin OSPF status on mr1-eqsin is CRITICAL ?
[12:01:51] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: multi-processing changes for articlequality and drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[12:01:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (owner: 10Jbond)
[12:02:06] <jynus>	 maybe it is flapping?
[12:04:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878177 (https://phabricator.wikimedia.org/T313733) (owner: 10Effie Mouzeli)
[12:04:20] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test p
[12:04:20] <icinga-wm>	 ed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[12:05:22] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[12:05:54] <jynus>	 yeah, it is flapping on and off, I will see if I can downtime it
[12:06:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb_443: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but 
[12:06:58] <icinga-wm>	 testlb6_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5022.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.
[12:06:58] <icinga-wm>	 p5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:08:34] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[12:09:17] <jynus>	 I will downtime LVS health checks on 4, 5 and 6 until tomorrow, CC vgutierrez in case they return earlier and have to be be deleted
[12:09:41] <jynus>	 lvs500X, it is understood
[12:10:08] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[12:10:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1003.eqiad.wmnet with OS bullseye
[12:10:17] <jynus>	 (they are up, but some backends aren't)
[12:10:58] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1004.eqiad.wmnet with OS bullseye
[12:11:24] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/878885
[12:11:42] <jynus>	 hopefully that solves the flapping alerts
[12:13:34] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[12:14:18] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[12:15:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/878885 (owner: 10Muehlenhoff)
[12:17:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers for various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/860912 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:17:51] <wikibugs>	 (03PS2) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884
[12:18:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast4004.wikimedia.org
[12:18:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:19:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (owner: 10Jbond)
[12:21:20] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment Add missing package [puppet] - 10https://gerrit.wikimedia.org/r/878928 (https://phabricator.wikimedia.org/T320795)
[12:21:37] <wikibugs>	 (03PS3) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884
[12:22:17] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment Add missing package [puppet] - 10https://gerrit.wikimedia.org/r/878928 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[12:22:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (owner: 10Jbond)
[12:24:20] <wikibugs>	 (03PS4) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884
[12:24:22] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:24:53] <wikibugs>	 (03PS5) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315)
[12:24:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[12:24:55] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99)
[12:25:26] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881
[12:25:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli)
[12:27:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:environment: Add ablilty to inject environment variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[12:27:43] <wikibugs>	 (03PS6) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315)
[12:27:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Updates of passwords of users created with postgresql::user / PostgreSQL change to scram-sha256 - https://phabricator.wikimedia.org/T326325 (10LSobanski)
[12:28:56] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:29:25] <wikibugs>	 (03PS3) 10Effie Mouzeli: memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881
[12:29:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli)
[12:30:11] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 56630
[12:30:37] <wikibugs>	 (03PS7) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315)
[12:31:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 56630
[12:31:37] <wikibugs>	 (03PS4) 10Effie Mouzeli: memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881
[12:33:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4004.wikimedia.org - jmm@cumin2002"
[12:34:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4004.wikimedia.org - jmm@cumin2002"
[12:34:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:34:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast4004.wikimedia.org on all recursors
[12:34:50] <wikibugs>	 (03PS1) 10Btullis: Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872)
[12:35:00] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) bast4004.wikimedia.org on all recursors
[12:36:36] <wikibugs>	 (03PS2) 10Btullis: Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872)
[12:36:41] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8849
[12:37:06] <wikibugs>	 (03PS8) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315)
[12:37:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli)
[12:38:21] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39074/console" [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[12:39:08] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:40:08] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8849
[12:40:09] <wikibugs>	 (03PS9) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315)
[12:40:18] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[12:41:22] <wikibugs>	 (03CR) 10Btullis: "Adding Amir and Manuel for sanity checking please." [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) (owner: 10Btullis)
[12:42:40] <moritzm>	 !log installing postgresql 11 security updates on maps/codfw
[12:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:43:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 40): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39075/console" [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli)
[12:43:27] <wikibugs>	 (03PS3) 10JMeybohm: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292)
[12:45:36] <wikibugs>	 (03PS10) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315)
[12:46:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] site: Remove retired mc* hosts [puppet] - 10https://gerrit.wikimedia.org/r/878177 (https://phabricator.wikimedia.org/T313733) (owner: 10Effie Mouzeli)
[12:46:24] <wikibugs>	 (03PS2) 10Effie Mouzeli: site: Remove retired mc* hosts [puppet] - 10https://gerrit.wikimedia.org/r/878177 (https://phabricator.wikimedia.org/T313733)
[12:49:15] <wikibugs>	 (03CR) 10Ayounsi: environment: add no_proxy config directly to environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[12:51:03] <godog>	 jynus: not sure tbh, maybe downtime expired
[12:51:14] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:53:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast4004.wikimedia.org
[12:56:02] <wikibugs>	 (03PS1) 10JMeybohm: coredns: Remove deprecated nodeSelector [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935
[12:57:21] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli)
[12:58:27] <wikibugs>	 (03CR) 10Muehlenhoff: bgpalerter: add profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878868 (owner: 10Jbond)
[12:59:28] <wikibugs>	 (03PS4) 10Jbond: puppet: allow to specify the exact message when disable/enable puppet [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773
[12:59:52] <wikibugs>	 (03CR) 10Jbond: "updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[12:59:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[13:01:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/878879 (owner: 10Muehlenhoff)
[13:01:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast6002.wikimedia.org
[13:01:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:03:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[13:03:52] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1038.eqiad.wmnet with OS bullseye
[13:04:33] <wikibugs>	 (03PS2) 10JMeybohm: coredns: Remove deprecated nodeSelector [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935
[13:06:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet: allow to specify the exact message when disable/enable puppet [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[13:06:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo Legal's take on what words we should be looking for" [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[13:07:29] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1038.eqiad.wmnet with OS bullseye
[13:07:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:rpkivalidator: add bgpalerter to rpki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[13:09:18] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:25] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: allow to specify the exact message when disable/enable puppet [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[13:11:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast6002.wikimedia.org - jmm@cumin2002"
[13:11:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast6002.wikimedia.org - jmm@cumin2002"
[13:11:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:11:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast6002.wikimedia.org on all recursors
[13:12:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) bast6002.wikimedia.org on all recursors
[13:12:16] <wikibugs>	 (03PS1) 10Jbond: P:bgpalerter: make sure we create the sysuser before calling the class [puppet] - 10https://gerrit.wikimedia.org/r/878937
[13:12:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Skip empty fixture files [deployment-charts] - 10https://gerrit.wikimedia.org/r/875947 (owner: 10JMeybohm)
[13:13:32] <icinga-wm>	 PROBLEM - Check systemd state on rpki1001 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:58] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:02] <wikibugs>	 (03PS2) 10Jbond: P:bgpalerter: make sure we create the sysuser before calling the class [puppet] - 10https://gerrit.wikimedia.org/r/878937
[13:15:11] <jbond>	 node-bgpalerter.service is expected ill fix
[13:15:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878937 (owner: 10Jbond)
[13:15:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:bgpalerter: make sure we create the sysuser before calling the class [puppet] - 10https://gerrit.wikimedia.org/r/878937 (owner: 10Jbond)
[13:18:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage
[13:20:10] <icinga-wm>	 PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:21:00] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage
[13:27:53] <wikibugs>	 (03CR) 10Gmodena: Add flink-app-example service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[13:29:18] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: minor fix for bullseye installation #2 [puppet] - 10https://gerrit.wikimedia.org/r/878939
[13:31:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast6002.wikimedia.org
[13:34:13] <wikibugs>	 (03CR) 10Ottomata: Add flink-app-example service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[13:35:32] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage
[13:38:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878939 (owner: 10Effie Mouzeli)
[13:38:39] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage
[13:42:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35753
[13:44:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35753
[13:44:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9584
[13:45:35] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9584
[13:45:47] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3302
[13:45:55] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade
[13:46:48] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3302
[13:47:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37002
[13:47:25] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37002
[13:47:44] <wikibugs>	 (03PS4) 10JMeybohm: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292)
[13:47:46] <wikibugs>	 (03PS3) 10JMeybohm: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729)
[13:47:48] <wikibugs>	 (03PS1) 10JMeybohm: Pin coredns, eventrouter and helm-state-metrics for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878940 (https://phabricator.wikimedia.org/T307943)
[13:47:50] <wikibugs>	 (03PS1) 10JMeybohm: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729)
[13:50:37] <wikibugs>	 (03Abandoned) 10Ayounsi: Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[13:50:58] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC NOOP https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39077" [puppet] - 10https://gerrit.wikimedia.org/r/878939 (owner: 10Effie Mouzeli)
[13:51:59] <wikibugs>	 (03PS4) 10JMeybohm: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729)
[13:52:01] <wikibugs>	 (03PS2) 10JMeybohm: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729)
[13:52:03] <wikibugs>	 (03PS1) 10JMeybohm: calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729)
[13:52:56] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] memcached: minor fix for bullseye installation #2 [puppet] - 10https://gerrit.wikimedia.org/r/878939 (owner: 10Effie Mouzeli)
[13:53:08] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: minor fix for bullseye installation #2 [puppet] - 10https://gerrit.wikimedia.org/r/878939
[13:54:09] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:54:25] <wikibugs>	 (03CR) 10Ayounsi: "That's more that 2 years old, is it still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/618767 (owner: 10Volans)
[13:55:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Add bast3006/bast4004/bast6002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878945 (https://phabricator.wikimedia.org/T324974)
[13:55:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[13:55:20] <wikibugs>	 (03Abandoned) 10Jbond: hieradata: add ASN name comments [puppet] - 10https://gerrit.wikimedia.org/r/753147 (owner: 10Jbond)
[13:55:43] <wikibugs>	 (03PS1) 10JMeybohm: staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729)
[13:58:11] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1400).
[14:00:05] <jouncebot>	 Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:32] <Lucas_WMDE>	 o/
[14:01:19] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1001.eqiad.wmnet
[14:02:18] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[14:02:22] * MichaelG_WMDE is here to
[14:02:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1004.eqiad.wmnet with OS bullseye
[14:02:27] <MichaelG_WMDE>	 *too
[14:03:37] <Lucas_WMDE>	 bleh
[14:03:38] <Lucas_WMDE>	 `scap backport 877983 877972`
[14:03:52] <Lucas_WMDE>	 “backport failed: <Exception> Request Failed: https://gerrit.wikimedia.org/r/changes/Icfe7f38fdf9c3255d51713d3084593f880425d06/revisions/current/crd 404 Multiple changes found for Icfe7f38fdf9c3255d51713d3084593f880425d06”
[14:04:04] <Lucas_WMDE>	 if only I had specified change numbers, which are unique, instead of ambiguous change IDs………
[14:04:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/WikibaseLexeme] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877983 (https://phabricator.wikimedia.org/T326621) (owner: 10Lucas Werkmeister (WMDE))
[14:04:36] <taavi>	 I think that's a (already reported) scap bug with dependencies accross multiple branches
[14:04:44] <Lucas_WMDE>	 looks like doing them one at a time will work, it’ll just take even longer in CI 🤷
[14:05:13] <taavi>	 or you can +2 them manually
[14:05:23] <Lucas_WMDE>	 yeah I’ll do that in a few minutes
[14:05:42] <Lucas_WMDE>	 looks like https://phabricator.wikimedia.org/T323277 is the phab task
[14:06:33] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bullseye
[14:08:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add bast3006/bast4004/bast6002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878945 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff)
[14:09:21] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "already +2ing to speed up backport later" [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE))
[14:10:22] <moritzm>	 !log installing postgresql 11 security updates on maps/eqiad
[14:10:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:58] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1001.eqiad.wmnet
[14:12:25] <icinga-wm>	 PROBLEM - puppet last run on puppetdb2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604986 seconds, message: maint, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:12:58] <moritzm>	 puppetdb2002 was me, fixing
[14:14:39] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99)
[14:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[14:16:03] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:17:52] * MichaelG_WMDE is afk, but back quickly
[14:18:01] <icinga-wm>	 RECOVERY - puppet last run on puppetdb2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:19:12] <wikibugs>	 (03Merged) 10jenkins-bot: Fix test constructing HTMLFormField without parent [extensions/WikibaseLexeme] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877983 (https://phabricator.wikimedia.org/T326621) (owner: 10Lucas Werkmeister (WMDE))
[14:19:41] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:877983|Fix test constructing HTMLFormField without parent (T326621)]]
[14:19:45] <stashbot>	 T326621: Wikibase\Lexeme\Tests\MediaWiki\Specials\HTMLForm\LemmaLanguageFieldTest::testValidateWithValidLanguageCodeReturnsTrue HTMLFormField::__construct: Constructing an HTMLFormField without a 'parent' parameter - https://phabricator.wikimedia.org/T326621
[14:21:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and lucaswerkmeister-wmde: Backport for [[gerrit:877983|Fix test constructing HTMLFormField without parent (T326621)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[14:22:03] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage
[14:22:11] * MichaelG_WMDE is back and looking at zuul
[14:23:51] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[14:24:03] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:21] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage
[14:25:22] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing parentheses to vector search match text [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE))
[14:25:29] <Lucas_WMDE>	 meh, it merged too soon
[14:25:33] <Lucas_WMDE>	 I’ll have to sync that one manually then
[14:27:13] <taavi>	 scap backport works just fine with already merged commits
[14:27:18] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:877983|Fix test constructing HTMLFormField without parent (T326621)]] (duration: 08m 38s)
[14:28:24] <stashbot>	 T326621: Wikibase\Lexeme\Tests\MediaWiki\Specials\HTMLForm\LemmaLanguageFieldTest::testValidateWithValidLanguageCodeReturnsTrue HTMLFormField::__construct: Constructing an HTMLFormField without a 'parent' parameter - https://phabricator.wikimedia.org/T326621
[14:28:55] <Lucas_WMDE>	 taavi: I get the “multiple changes found” error again so I guess it’s still confused by the Depends-On
[14:29:03] <taavi>	 ah, hmm
[14:29:06] <Lucas_WMDE>	 (from just `scap backport 877972`)
[14:29:53] <Lucas_WMDE>	 pulled the Wikibase change to mwdebug1001
[14:30:03] <Lucas_WMDE>	 should be testable on test wikidata
[14:30:04] <wikibugs>	 (03PS1) 10Jelto: sre.gitlab.upgrade: use url instead of hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/878953 (https://phabricator.wikimedia.org/T323569)
[14:30:06] <Lucas_WMDE>	 (cc MichaelG_WMDE)
[14:30:18] * MichaelG_WMDE looks
[14:30:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto)
[14:30:43] <Lucas_WMDE>	 looks good to me, I think
[14:30:49] <MichaelG_WMDE>	 works for me!
[14:31:43] <MichaelG_WMDE>	 I'm fine with this moving forward :)
[14:32:18] <Lucas_WMDE>	 syncing
[14:32:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "makes sense" [cookbooks] - 10https://gerrit.wikimedia.org/r/878953 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:32:45] <Lucas_WMDE>	 (with T326633 in the log message because I think it’s better than no task at all)
[14:32:46] <stashbot>	 T326633: Monitor the deployment of the new Search on the 2022 version of the Vector skin - https://phabricator.wikimedia.org/T326633
[14:35:57] <MichaelG_WMDE>	 yep, I think that is what that task for, all the misc stuff
[14:39:10] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.18/extensions/Wikibase/repo/resources/wikibase.vector.searchClient.js: Backport: [[gerrit:877972|Add missing parentheses to vector search match text (T326633)]] (1/2) (duration: 07m 09s)
[14:39:25] <stashbot>	 T326633: Monitor the deployment of the new Search on the 2022 version of the Vector skin - https://phabricator.wikimedia.org/T326633
[14:39:58] <Lucas_WMDE>	 and syncing the second file now
[14:40:05] <Lucas_WMDE>	 (just for consistency)
[14:40:42] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use url instead of hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/878953 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:42:02] <MichaelG_WMDE>	 I can confirm that it now also works on test.wikidata without WikimediaDebug 
[14:42:12] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[14:42:48] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: use url instead of hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/878953 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:42:54] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10StephaneRebai) Thank you @Jelto i will verify access and close this when done
[14:44:38] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: set maxSurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/878957 (https://phabricator.wikimedia.org/T233196)
[14:44:49] <Lucas_WMDE>	 yay
[14:46:09] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra_dev: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878186 (owner: 10Eevans)
[14:46:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.18/extensions/Wikibase/repo/tests/jest/wikibase.vector.searchClient.spec.js: Backport: [[gerrit:877972|Add missing parentheses to vector search match text (T326633)]] (2/2) (duration: 06m 46s)
[14:46:40] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[14:46:41] <stashbot>	 T326633: Monitor the deployment of the new Search on the 2022 version of the Vector skin - https://phabricator.wikimedia.org/T326633
[14:46:46] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1005.eqiad.wmnet with OS bullseye
[14:46:56] <Lucas_WMDE>	 I don’t see anything else in the deployment calendar
[14:47:01] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:50] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:49:01] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] cassandra_dev: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878186 (owner: 10Eevans)
[14:54:10] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:39] <wikibugs>	 (03PS2) 10Eevans: Upgraded deployment-prep echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860941
[14:56:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[14:58:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10jbond) > With bullseye apt even does this automatically  wonder if we could backport this to buster, ignore stretch and call it done?
[14:58:54] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:37] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Start using the ClusterConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto)
[15:04:29] <wikibugs>	 (03PS1) 10Andrew Bogott: Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960
[15:04:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 (owner: 10Andrew Bogott)
[15:06:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10MoritzMuehlenhoff) >>! In T277011#8516622, @jbond wrote: >> With bullseye apt even does this automatically  > wonder if we could backport this to buster, ignore...
[15:09:23] <wikibugs>	 (03PS11) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[15:10:07] <wikibugs>	 (03PS2) 10Muehlenhoff: package_builder::pbuilder_hook: Manage the hook directory with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878879
[15:10:38] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:00] <wikibugs>	 (03PS12) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[15:12:03] <wikibugs>	 (03PS1) 10Effie Mouzeli: P:memcached::memkeys: do not install memkeys if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970)
[15:12:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:memcached::memkeys: do not install memkeys if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli)
[15:13:35] <wikibugs>	 (03PS2) 10Effie Mouzeli: P:memcached::memkeys: do not install memkeys if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970)
[15:14:21] <wikibugs>	 (03PS2) 10Andrew Bogott: Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960
[15:14:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 (owner: 10Andrew Bogott)
[15:14:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10jbond) >>! In T277011#8516648, @MoritzMuehlenhoff wrote: >>>! In T277011#8516622, @jbond wrote: >>> With bullseye apt even does this automatically  >> wonder if...
[15:15:46] <wikibugs>	 (03CR) 10Muehlenhoff: P:memcached::memkeys: do not install memkeys if on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli)
[15:17:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P42982 and previous config saved to /var/cache/conftool/dbconfig/20230111-151712-marostegui.json
[15:17:31] <wikibugs>	 (03PS3) 10Effie Mouzeli: P:memcached::memkeys: do not install memkeys if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970)
[15:18:02] <wikibugs>	 (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878964
[15:18:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878964 (owner: 10Marostegui)
[15:21:24] <marostegui>	 !log Stop mariadb on db1106 to reclone db1206 (there will be lag on s1 on wikireplicas) T326669
[15:21:25] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC OK  https://puppet-compiler.wmflabs.org/output/878962/39081/" [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli)
[15:21:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:28] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[15:23:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) (owner: 10Btullis)
[15:27:02] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: add default monitors and reports [puppet] - 10https://gerrit.wikimedia.org/r/879046
[15:30:29] <wikibugs>	 (03PS2) 10Muehlenhoff: postgresql::user: No longer compare password hashes on Bookworm and later [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325)
[15:31:37] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:32:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39083/console" [puppet] - 10https://gerrit.wikimedia.org/r/879046 (owner: 10Jbond)
[15:32:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] bgpalerter: add default monitors and reports [puppet] - 10https://gerrit.wikimedia.org/r/879046 (owner: 10Jbond)
[15:32:37] <zabe>	 jouncebot, nowandnext
[15:32:37] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 27 minute(s)
[15:32:37] <jouncebot>	 In 2 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1800)
[15:33:30] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Start reading from cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878870 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[15:34:25] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878870 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[15:34:52] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:878870|Start reading from cul_actor everywhere (T233004)]]
[15:34:56] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[15:36:34] <logmsgbot>	 !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:878870|Start reading from cul_actor everywhere (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[15:37:49] <jinxer-wm>	 (ProbeDown) firing: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:37:49] <jinxer-wm>	 (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:37:57] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: use wss/https for websocket connection [puppet] - 10https://gerrit.wikimedia.org/r/879048
[15:38:17] <RhinosF1>	 zabe: something just went down
[15:38:30] <godog>	 looking, got paged
[15:38:34] <jinxer-wm>	 (virtual-chassis crash) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[15:38:40] <akosiaris>	 ncredir? this is bigger?
[15:38:44] <jynus>	 weird, we just got pages from eqsin
[15:38:46] <jinxer-wm>	 (ThanosSidecarBucketOperationsFailed) firing: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed
[15:38:52] <logmsgbot>	 !log zabe@deploy1002 sync-world aborted: Backport for [[gerrit:878870|Start reading from cul_actor everywhere (T233004)]] (duration: 04m 00s)
[15:38:53] <logmsgbot>	 !log zabe@deploy1002 backport aborted:  (duration: 04m 25s)
[15:38:56] <XioNoX>	 godog: dunno if related, but the faulty switch in eqsin just came back up
[15:39:03] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:39:04] <RhinosF1>	 godog: fyi, deployment in progress
[15:39:07] <jynus>	 XioNoX: that would explain it
[15:39:16] <akosiaris>	 ah, eqsin, ok
[15:39:18] <jynus>	 but let's double check
[15:39:23] <godog>	 could be yeah, I'll silence and ack the alerts
[15:39:24] <jynus>	 no user affected
[15:39:32] <XioNoX>	 godog: but it should't alert :)
[15:39:36] <XioNoX>	 as it's coming back up
[15:39:50] <jynus>	 XioNoX: I think it could happen because there was some uknowns
[15:40:12] <jynus>	 that become knowns, and then could wake up in bad order, but checking it is not something else
[15:41:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] bgpalerter: use wss/https for websocket connection [puppet] - 10https://gerrit.wikimedia.org/r/879048 (owner: 10Jbond)
[15:41:19] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) > Next step is onsite "format install" https://supportportal.juniper.net/s/article/EX-QFX-Procedure-to-format-install-QFX5K-device-using-a-USB?lang...
[15:41:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:41:27] <godog>	 mmhh yeah not sure yet why ProbeDown notified tbh
[15:41:38] <godog>	 but it recovered alright
[15:41:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:42:03] <jynus>	 I will remove the downtimes I added to make sure all services come back online
[15:42:24] <jynus>	 as in, "health checks happen correctly"
[15:42:34] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:42:46] <jinxer-wm>	 (JobUnavailable) firing: (24) Reduced availability for job bird in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:42:49] <jinxer-wm>	 (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:42:49] <jinxer-wm>	 (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:43:30] <wikibugs>	 (03PS4) 10Effie Mouzeli: P:memcached::memkeys: install memkeys only if on buster [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970)
[15:43:43] <wikibugs>	 (03PS1) 10Marostegui: db1206: Future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/879049
[15:43:46] <jinxer-wm>	 (ThanosSidecarBucketOperationsFailed) resolved: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed
[15:44:16] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff)
[15:44:45] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:45:00] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Upgraded deployment-prep echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860941 (owner: 10Eevans)
[15:45:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206: Future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/879049 (owner: 10Marostegui)
[15:45:16] <logmsgbot>	 !log zabe@deploy1002 Started scap: T233004
[15:45:19] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[15:45:46] <wikibugs>	 (03Merged) 10jenkins-bot: Upgraded deployment-prep echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860941 (owner: 10Eevans)
[15:46:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli)
[15:47:34] <jinxer-wm>	 (Emergency syslog message) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:48:15] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:48:34] <jinxer-wm>	 (virtual-chassis crash) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[15:48:48] <wikibugs>	 (03CR) 10Effie Mouzeli: P:memcached::memkeys: install memkeys only if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli)
[15:50:00] <wikibugs>	 (03PS3) 10Andrew Bogott: Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960
[15:50:27] <wikibugs>	 (03PS1) 10Ottomata: flink - Add examples/wikimedia with simple table datagen -> print pipeline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/879050 (https://phabricator.wikimedia.org/T316519)
[15:50:31] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] P:memcached::memkeys: install memkeys only if on buster [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli)
[15:51:35] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink - Add examples/wikimedia with simple table datagen -> print pipeline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/879050 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[15:53:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 (owner: 10Andrew Bogott)
[15:53:10] <logmsgbot>	 !log zabe@deploy1002 Finished scap: T233004 (duration: 07m 54s)
[15:53:14] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[15:54:31] <wikibugs>	 (03PS3) 10Muehlenhoff: postgresql::user: No longer compare password hashes on Bookworm and later [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325)
[15:54:37] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385)
[15:55:33] <icinga-wm>	 RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:56:07] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff)
[15:56:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi)
[15:56:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn)
[15:57:58] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM, we should probably add that (or a similar mechanism) by defaults in the service scaffolding." [deployment-charts] - 10https://gerrit.wikimedia.org/r/878957 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[15:58:35] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:58:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:00:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:00:21] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] admin: Add Jennifer Hancock to the datacenter-ops group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[16:00:39] <wikibugs>	 (03PS2) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[16:00:55] <wikibugs>	 (03PS1) 10Andrew Bogott: Replace 'yoga' with 'zed' in a zed manifest [puppet] - 10https://gerrit.wikimedia.org/r/879052
[16:01:50] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host mc1038.eqiad.wmnet with OS bullseye
[16:01:56] <wikibugs>	 (03CR) 10Muehlenhoff: postgresql::user: No longer compare password hashes on Bookworm and later (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff)
[16:02:04] <wikibugs>	 (03PS3) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[16:02:23] <wikibugs>	 (03PS1) 10Jdrewniak: Fix mustache template rendering when TOC is rerendered after an edit [skins/Vector] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879093 (https://phabricator.wikimedia.org/T326682)
[16:02:44] <wikibugs>	 (03PS1) 10Jdrewniak: Fix mustache template rendering when TOC is rerendered after an edit [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879094 (https://phabricator.wikimedia.org/T326682)
[16:03:14] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[16:04:20] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385)
[16:04:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:05:06] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Force update after eqsin outage is over - volans@cumin1001"
[16:05:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Replace 'yoga' with 'zed' in a zed manifest [puppet] - 10https://gerrit.wikimedia.org/r/879052 (owner: 10Andrew Bogott)
[16:05:53] <wikibugs>	 (03PS1) 10Marostegui: add_cul_reason_id_T321391.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/879054
[16:06:10] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Force update after eqsin outage is over - volans@cumin1001"
[16:06:10] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:06:51] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:07:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:10:30] <wikibugs>	 (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große)
[16:10:49] <wikibugs>	 (03CR) 10Ottomata: Add flink-app-example service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:12:56] <wikibugs>	 (03PS5) 10JMeybohm: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292)
[16:12:58] <wikibugs>	 (03PS2) 10JMeybohm: Pin coredns, eventrouter and helm-state-metrics for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878940 (https://phabricator.wikimedia.org/T307943)
[16:13:00] <wikibugs>	 (03PS5) 10JMeybohm: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729)
[16:13:02] <wikibugs>	 (03PS3) 10JMeybohm: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729)
[16:13:04] <wikibugs>	 (03PS2) 10JMeybohm: calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729)
[16:13:06] <wikibugs>	 (03PS2) 10JMeybohm: staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729)
[16:13:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Skip empty fixture files [deployment-charts] - 10https://gerrit.wikimedia.org/r/875947 (owner: 10JMeybohm)
[16:15:45] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:07] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Enable the API on test-wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große)
[16:16:31] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:17:01] <wikibugs>	 (03PS2) 10Michael Große: Enable the REST API on test-wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999)
[16:17:35] <wikibugs>	 (03CR) 10Michael Große: Enable the REST API on test-wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große)
[16:18:38] <wikibugs>	 (03Merged) 10jenkins-bot: Skip empty fixture files [deployment-charts] - 10https://gerrit.wikimedia.org/r/875947 (owner: 10JMeybohm)
[16:18:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:19:21] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385)
[16:20:01] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:21:30] <wikibugs>	 (03CR) 10JMeybohm: Add flink-app-example service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:22:00] <marostegui>	 !log dbmaint deploy schema change with replication on s6 eqiad T321391 
[16:22:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:03] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[16:22:35] <wikibugs>	 (03CR) 10Ayounsi: "Some outstanding questions:" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi)
[16:23:13] <wikibugs>	 (03PS1) 10Zabe: Start reading from cuc_actor on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879055 (https://phabricator.wikimedia.org/T233004)
[16:24:03] <wikibugs>	 (03CR) 10Ayounsi: Netbox: add support for central Redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi)
[16:25:36] <wikibugs>	 (03CR) 10Ayounsi: Netbox: add support for central Redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi)
[16:25:52] <marostegui>	 !log dbmaint deploy schema change with replication on s8 eqiad T321391 
[16:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:12] <wikibugs>	 (03PS1) 10Zabe: Stop writing to cul_user and cul_user_text on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004)
[16:28:14] <wikibugs>	 (03PS4) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[16:28:33] <wikibugs>	 (03CR) 10Ottomata: Add flink-app-example service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:30:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:31:23] <marostegui>	 !log dbmaint deploy schema change with replication on s4 eqiad T321391 
[16:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:27] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[16:33:11] <wikibugs>	 (03PS5) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[16:35:15] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] add_cul_reason_id_T321391.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/879054 (owner: 10Marostegui)
[16:35:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] add_cul_reason_id_T321391.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/879054 (owner: 10Marostegui)
[16:35:49] <wikibugs>	 (03Merged) 10jenkins-bot: add_cul_reason_id_T321391.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/879054 (owner: 10Marostegui)
[16:35:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:37:34] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25
[16:38:49] <marostegui>	 !log dbmaint deploy schema change with replication on s5 eqiad T321391 
[16:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:52] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[16:40:05] <wikibugs>	 (03PS6) 10Ottomata: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[16:41:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[16:41:55] <wikibugs>	 (03PS1) 10Jelto: sre.gitlab.upgrade: use spicerack reason instead of string [cookbooks] - 10https://gerrit.wikimedia.org/r/879059 (https://phabricator.wikimedia.org/T323569)
[16:41:57] <wikibugs>	 (03PS1) 10Jelto: sre.gitlab.upgrade: check for high threshold in fail_for_disk_space [cookbooks] - 10https://gerrit.wikimedia.org/r/879060 (https://phabricator.wikimedia.org/T323569)
[16:42:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Pin coredns, eventrouter and helm-state-metrics for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878940 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[16:42:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add istio config for main/wikikube clusters on k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878190 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[16:42:36] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/879059 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:43:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/879060 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:44:24] <wikibugs>	 (03PS7) 10Ottomata: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[16:45:52] <wikibugs>	 (03PS8) 10Ottomata: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[16:46:01] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: check for high threshold in fail_for_disk_space [cookbooks] - 10https://gerrit.wikimedia.org/r/879060 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:46:03] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use spicerack reason instead of string [cookbooks] - 10https://gerrit.wikimedia.org/r/879059 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:46:30] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi)
[16:47:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add istio config for main/wikikube clusters on k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878190 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[16:47:17] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:47:50] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: use spicerack reason instead of string [cookbooks] - 10https://gerrit.wikimedia.org/r/879059 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:47:57] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: check for high threshold in fail_for_disk_space [cookbooks] - 10https://gerrit.wikimedia.org/r/879060 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:51:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:52:34] <jinxer-wm>	 (Processor usage over 85%) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Processor usage over 85%   - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25
[16:53:17] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:53:54] <wikibugs>	 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) 05Stalled→03Resolved That's all done.
[16:54:01] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi)
[16:54:19] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/879095
[16:54:23] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/879095 (owner: 10Marostegui)
[16:54:27] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi)
[16:55:09] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "This looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:56:45] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] varnish: Template out thread pool settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[16:57:07] <wikibugs>	 (03Merged) 10jenkins-bot: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[16:57:10] <wikibugs>	 (03Merged) 10jenkins-bot: Pin coredns, eventrouter and helm-state-metrics for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878940 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[16:58:47] <wikibugs>	 (03PS6) 10JMeybohm: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729)
[16:58:49] <wikibugs>	 (03PS4) 10JMeybohm: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729)
[16:58:51] <wikibugs>	 (03PS3) 10JMeybohm: calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729)
[16:58:53] <wikibugs>	 (03PS3) 10JMeybohm: staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729)
[16:59:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Detect the correct disks for the O/S on the cephosd servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[17:00:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) (owner: 10Btullis)
[17:00:43] <wikibugs>	 (03PS3) 10Btullis: Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872)
[17:03:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[17:03:32] <wikibugs>	 (03PS5) 10BCornwall: varnish: Template out thread pool settings [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723)
[17:03:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[17:04:10] <marostegui>	 !log dbmaint deploy schema change with replication on s7 eqiad T321391 
[17:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:13] <wikibugs>	 (03PS9) 10Ottomata: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[17:04:13] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[17:06:00] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39090/console" [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[17:08:22] <wikibugs>	 (03CR) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/879095 (owner: 10Marostegui)
[17:08:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Jelto)
[17:08:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/879095 (owner: 10Marostegui)
[17:09:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance
[17:10:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance
[17:10:12] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm)
[17:10:15] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm)
[17:10:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm)
[17:10:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm)
[17:10:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 1%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42987 and previous config saved to /var/cache/conftool/dbconfig/20230111-171021-root.json
[17:10:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance
[17:10:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance
[17:10:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance
[17:11:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance
[17:11:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2112 (T321391)', diff saved to https://phabricator.wikimedia.org/P42988 and previous config saved to /var/cache/conftool/dbconfig/20230111-171114-marostegui.json
[17:11:18] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[17:13:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T321391)', diff saved to https://phabricator.wikimedia.org/P42989 and previous config saved to /var/cache/conftool/dbconfig/20230111-171338-marostegui.json
[17:14:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[17:15:40] <wikibugs>	 (03Merged) 10jenkins-bot: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm)
[17:15:42] <wikibugs>	 (03Merged) 10jenkins-bot: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm)
[17:15:44] <wikibugs>	 (03Merged) 10jenkins-bot: calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm)
[17:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm)
[17:17:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: set maxSurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/878957 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[17:18:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[17:18:34] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[17:20:37] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:21:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:21:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:21:29] <wikibugs>	 (03PS5) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[17:21:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:23:32] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: set maxSurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/878957 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[17:23:53] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39091/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:25:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 5%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42991 and previous config saved to /var/cache/conftool/dbconfig/20230111-172526-root.json
[17:28:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:28:35] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:28:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P42992 and previous config saved to /var/cache/conftool/dbconfig/20230111-172844-marostegui.json
[17:29:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[17:30:54] <wikibugs>	 (03PS1) 10JMeybohm: staging-codfw: Unpin eventrouter, helm-state-metrics, coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879063 (https://phabricator.wikimedia.org/T326340)
[17:31:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff)
[17:36:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] staging-codfw: Unpin eventrouter, helm-state-metrics, coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879063 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[17:37:01] <wikibugs>	 (03PS1) 10BCornwall: Remove all legacy_vip entries [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993)
[17:39:48] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[17:40:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[17:40:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 10%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42993 and previous config saved to /var/cache/conftool/dbconfig/20230111-174031-root.json
[17:42:15] <wikibugs>	 (03Merged) 10jenkins-bot: staging-codfw: Unpin eventrouter, helm-state-metrics, coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879063 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[17:42:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] statistics: assert new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn)
[17:42:23] <wikibugs>	 (03PS3) 10Dzahn: statistics: assert new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276
[17:42:39] <wikibugs>	 (03CR) 10Dzahn: statistics: assert new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn)
[17:42:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:43:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P42994 and previous config saved to /var/cache/conftool/dbconfig/20230111-174351-marostegui.json
[17:47:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:53:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[17:53:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[17:54:09] <wikibugs>	 (03PS6) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[17:55:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42995 and previous config saved to /var/cache/conftool/dbconfig/20230111-175536-root.json
[17:55:57] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-worker[1080,1084].eqiad.wmnet with reason: Shutting down to enable RAID battery replacement
[17:56:12] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1080,1084].eqiad.wmnet with reason: Shutting down to enable RAID battery replacement
[17:56:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=af7b1865-a9a0-44ba-aa68-9f34812e0d62) set by btullis@cumin1001 for 7 days, 0:...
[17:56:25] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39092/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:57:07] <wikibugs>	 10SRE, 10Traffic: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall)
[17:57:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10BTullis) Thanks @jcrespo - I've added another 7 days downtime.  @Jclark-ctr any idea when you might be able to fit in this battery replacement...
[17:57:27] <wikibugs>	 10SRE, 10Traffic: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall) p:05Triage→03Low
[17:57:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:58:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[17:58:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T321391)', diff saved to https://phabricator.wikimedia.org/P42996 and previous config saved to /var/cache/conftool/dbconfig/20230111-175857-marostegui.json
[17:59:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance
[17:59:01] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[17:59:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance
[17:59:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T321391)', diff saved to https://phabricator.wikimedia.org/P42997 and previous config saved to /var/cache/conftool/dbconfig/20230111-175919-marostegui.json
[17:59:41] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1800)
[18:01:14] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[18:01:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321391)', diff saved to https://phabricator.wikimedia.org/P42998 and previous config saved to /var/cache/conftool/dbconfig/20230111-180142-marostegui.json
[18:02:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:02:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[18:04:33] <wikibugs>	 10SRE, 10Traffic: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall)
[18:05:21] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39093/console" [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall)
[18:06:04] <wikibugs>	 (03PS1) 10BBlack: Revert "depool eqsin for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/879111
[18:07:51] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Don't pin image version of coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879112 (https://phabricator.wikimedia.org/T326340)
[18:07:57] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[18:08:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[18:08:17] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39094/console" [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall)
[18:08:59] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[18:09:26] <wikibugs>	 (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[18:09:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[18:09:46] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[18:10:20] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Thanks for chasing all this down, nice result!" [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall)
[18:10:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42999 and previous config saved to /var/cache/conftool/dbconfig/20230111-181041-root.json
[18:10:42] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[18:12:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:13:42] <wikibugs>	 (03CR) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[18:14:48] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[18:15:47] <wikibugs>	 (03Merged) 10jenkins-bot: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[18:16:03] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39095/console" [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall)
[18:16:17] <wikibugs>	 (03PS1) 10Majavah: P:openstack::galera: add missing @resolve [puppet] - 10https://gerrit.wikimedia.org/r/879115
[18:16:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P43000 and previous config saved to /var/cache/conftool/dbconfig/20230111-181648-marostegui.json
[18:17:26] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove all legacy_vip entries [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall)
[18:20:12] <wikibugs>	 (03CR) 10Ssingh: "PCC looks good! See inline comments once before we merge this" [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[18:20:39] <wikibugs>	 (03PS1) 10Ottomata: flink-app-example - use correct patch to kubeconfig file in stream-enricnment-poc [deployment-charts] - 10https://gerrit.wikimedia.org/r/879116 (https://phabricator.wikimedia.org/T324576)
[18:21:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::galera: add missing @resolve [puppet] - 10https://gerrit.wikimedia.org/r/879115 (owner: 10Majavah)
[18:21:47] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng: Don't pin image version of coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879112 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[18:22:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop on stat1004, an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn)
[18:22:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:22:49] <logmsgbot>	 !log btullis@cumin1001 Added views for new wiki: blkwiki T310872
[18:22:50] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0)
[18:22:53] <stashbot>	 T310872: Prepare and check storage layer for blkwiki - https://phabricator.wikimedia.org/T310872
[18:24:37] <wikibugs>	 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Holiday Leftovers 🥡), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10thcipriani)
[18:25:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P43001 and previous config saved to /var/cache/conftool/dbconfig/20230111-182546-root.json
[18:25:53] <wikibugs>	 (03Abandoned) 10BBlack: lvs recdns: remove legacy IP definition, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/556178 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack)
[18:25:59] <wikibugs>	 (03Abandoned) 10BBlack: lvs recdns: remove legacy IP definition, step 2 [puppet] - 10https://gerrit.wikimedia.org/r/556179 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack)
[18:26:40] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Don't pin image version of coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879112 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[18:27:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:27:59] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "depool eqsin for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/879111 (owner: 10BBlack)
[18:28:13] <bblack>	 !log repool eqsin edge DC
[18:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:38] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-app-example - use correct patch to kubeconfig file in stream-enricnment-poc [deployment-charts] - 10https://gerrit.wikimedia.org/r/879116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[18:30:04] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[18:30:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[18:31:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P43002 and previous config saved to /var/cache/conftool/dbconfig/20230111-183155-marostegui.json
[18:32:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[18:33:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[18:33:54] <icinga-wm>	 PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:33:54] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@5a19b9d]: drop-snapshots: Accept snapshot= partition from any level
[18:35:25] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app-example - use correct patch to kubeconfig file in stream-enricnment-poc [deployment-charts] - 10https://gerrit.wikimedia.org/r/879116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[18:36:27] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@5a19b9d]: drop-snapshots: Accept snapshot= partition from any level (duration: 02m 33s)
[18:37:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:40:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P43003 and previous config saved to /var/cache/conftool/dbconfig/20230111-184051-root.json
[18:42:18] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: 1/2 a day to track down a missing 's' :@ [puppet] - 10https://gerrit.wikimedia.org/r/879119
[18:42:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:43:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39096/console" [puppet] - 10https://gerrit.wikimedia.org/r/879119 (owner: 10Jbond)
[18:45:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] bgpalerter: 1/2 a day to track down a missing 's' :@ [puppet] - 10https://gerrit.wikimedia.org/r/879119 (owner: 10Jbond)
[18:47:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321391)', diff saved to https://phabricator.wikimedia.org/P43004 and previous config saved to /var/cache/conftool/dbconfig/20230111-184701-marostegui.json
[18:47:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance
[18:47:06] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[18:47:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance
[18:47:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T321391)', diff saved to https://phabricator.wikimedia.org/P43005 and previous config saved to /var/cache/conftool/dbconfig/20230111-184723-marostegui.json
[18:47:53] <marostegui>	 !log dbmaint deploy schema change with replication on s2 eqiad T321391 
[18:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321391)', diff saved to https://phabricator.wikimedia.org/P43006 and previous config saved to /var/cache/conftool/dbconfig/20230111-184946-marostegui.json
[18:51:20] <wikibugs>	 (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[18:52:20] <brett>	 !log Removing legacy vips from dns servers - T239993
[18:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:23] <stashbot>	 T239993: Decom LVS recdns - https://phabricator.wikimedia.org/T239993
[18:52:27] <wikibugs>	 (03PS1) 10Jdlrobson: Enable page tools on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879121
[18:52:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:54:34] <wikibugs>	 (03CR) 10Dzahn: "Majavah, I think you were technically a contributor in git log, if you agree then this is 100%" [puppet] - 10https://gerrit.wikimedia.org/r/878205 (owner: 10Dzahn)
[18:56:13] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "don't remember what I did here, but sure" [puppet] - 10https://gerrit.wikimedia.org/r/878205 (owner: 10Dzahn)
[18:57:31] <marostegui>	 !log dbmaint deploy schema change with replication on s3 eqiad T321391 
[18:57:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:35] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[18:57:38] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[18:59:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "see https://gerrit.wikimedia.org/r/c/operations/puppet/+/815290  f.e. , thanks" [puppet] - 10https://gerrit.wikimedia.org/r/878205 (owner: 10Dzahn)
[19:00:05] <jouncebot>	 jeena and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1900).
[19:00:05] <jouncebot>	 jeena and dduvall: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1900).
[19:00:29] <mutante>	 brett: multi-merge on puppetmaster, but mine is "add license headers" and yours is just "slightly" more risky with "remove VIP from DNS server".. so it's all yours :o
[19:00:48] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:00:54] <brett>	 got it, thanks!
[19:01:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43007 and previous config saved to /var/cache/conftool/dbconfig/20230111-190111-root.json
[19:04:47] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng RBAC: Permit deploy users to interact with more resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/879122 (https://phabricator.wikimedia.org/T324576)
[19:04:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P43008 and previous config saved to /var/cache/conftool/dbconfig/20230111-190453-marostegui.json
[19:06:10] <wikibugs>	 (03CR) 10JMeybohm: "If this works, we should probably if-guard the other CRD permissions as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/879122 (https://phabricator.wikimedia.org/T324576) (owner: 10JMeybohm)
[19:07:23] <wikibugs>	 (03CR) 10Dzahn: "@Papaul, Jennifer is now twice in the admin module, can you please remove her from the "ldap_only" section" [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[19:10:04] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:10:18] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:10:52] <jeena>	 train is blocked, will resume after https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/879094/ has had QA and backport
[19:11:34] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:12:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:13:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng RBAC: Permit deploy users to interact with more resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/879122 (https://phabricator.wikimedia.org/T324576) (owner: 10JMeybohm)
[19:15:53] <wikibugs>	 (03PS1) 10Dzahn: librenms: assert data type for globally reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/879123
[19:16:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43009 and previous config saved to /var/cache/conftool/dbconfig/20230111-191616-root.json
[19:17:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:17:58] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng RBAC: Permit deploy users to interact with more resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/879122 (https://phabricator.wikimedia.org/T324576) (owner: 10JMeybohm)
[19:19:18] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[19:19:28] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[19:20:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P43010 and previous config saved to /var/cache/conftool/dbconfig/20230111-192000-marostegui.json
[19:20:44] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply
[19:20:50] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply
[19:24:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:24:48] <icinga-wm>	 PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[19:27:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:29:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:30:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[19:31:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43011 and previous config saved to /var/cache/conftool/dbconfig/20230111-193121-root.json
[19:32:23] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/879123/39097/netmon1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/879123 (owner: 10Dzahn)
[19:32:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:35:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321391)', diff saved to https://phabricator.wikimedia.org/P43012 and previous config saved to /var/cache/conftool/dbconfig/20230111-193506-marostegui.json
[19:35:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance
[19:35:11] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[19:35:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance
[19:35:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance
[19:35:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance
[19:36:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T321391)', diff saved to https://phabricator.wikimedia.org/P43013 and previous config saved to /var/cache/conftool/dbconfig/20230111-193601-marostegui.json
[19:37:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:38:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321391)', diff saved to https://phabricator.wikimedia.org/P43014 and previous config saved to /var/cache/conftool/dbconfig/20230111-193825-marostegui.json
[19:38:52] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/879123 (owner: 10Dzahn)
[19:39:33] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] librenms: assert data type for globally reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/879123 (owner: 10Dzahn)
[19:41:56] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on netmon1003" [puppet] - 10https://gerrit.wikimedia.org/r/879123 (owner: 10Dzahn)
[19:42:42] <wikibugs>	 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Andrew) via elimination I've convinced myself that the issue here is 10_dumps_rsyncd :     ` # Autogener...
[19:45:40] <wikibugs>	 (03CR) 10Dzahn: "meanwhile I have mailed the affcom list and they confirmed they are working on it - on hold" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn)
[19:46:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43015 and previous config saved to /var/cache/conftool/dbconfig/20230111-194626-root.json
[19:51:04] <wikibugs>	 (03PS3) 10Dzahn: phabricator: rewrite https://phabricator.wikimedia.org/r/ to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/869853 (https://phabricator.wikimedia.org/T324311)
[19:52:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:53:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P43016 and previous config saved to /var/cache/conftool/dbconfig/20230111-195332-marostegui.json
[20:01:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43017 and previous config saved to /var/cache/conftool/dbconfig/20230111-200131-root.json
[20:02:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:08:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P43018 and previous config saved to /var/cache/conftool/dbconfig/20230111-200838-marostegui.json
[20:12:24] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879099 (https://phabricator.wikimedia.org/T301063)
[20:12:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:16:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43019 and previous config saved to /var/cache/conftool/dbconfig/20230111-201636-root.json
[20:17:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:18:23] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BCornwall) 05Open→03Resolved @ayounsi Thanks for the detailed explanation and the help! I've removed the legacy_vip stuff from puppet, rolled it out, and deleted the IPs from the servers. I've als...
[20:18:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:19:02] <wikibugs>	 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Andrew) The troublesome entries are:  `  ftp.acc.umu.se mirror.accum.se ftp.acc.umu.se mirror.accum.se `...
[20:20:03] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "(There was a merge conflict because https://gerrit.wikimedia.org/r/c/mediawiki/core/+/876270 isn't present in wmf.17)" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński)
[20:22:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:23:14] <zabe>	 jeena, are you currently busy or may I slide in a config change?
[20:23:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321391)', diff saved to https://phabricator.wikimedia.org/P43020 and previous config saved to /var/cache/conftool/dbconfig/20230111-202345-marostegui.json
[20:23:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance
[20:23:49] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[20:23:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:23:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10RobH)
[20:24:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance
[20:24:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T321391)', diff saved to https://phabricator.wikimedia.org/P43021 and previous config saved to /var/cache/conftool/dbconfig/20230111-202417-marostegui.json
[20:26:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321391)', diff saved to https://phabricator.wikimedia.org/P43022 and previous config saved to /var/cache/conftool/dbconfig/20230111-202641-marostegui.json
[20:27:42] <wikibugs>	 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Andrew) I don't see any real problem with those hosts other than that they're duplicates of each other....
[20:31:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43023 and previous config saved to /var/cache/conftool/dbconfig/20230111-203141-root.json
[20:32:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:36:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński)
[20:37:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) Fixed
[20:37:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:39:17] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix phan error when Excimer is enabled [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879100
[20:39:37] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063)
[20:39:57] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "(New test failure is unrelated to the change, will be fixed by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/879100)" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński)
[20:40:40] <wikibugs>	 (03PS1) 10Papaul: Remove Jennifer from the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/879131 (https://phabricator.wikimedia.org/T326649)
[20:41:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P43024 and previous config saved to /var/cache/conftool/dbconfig/20230111-204147-marostegui.json
[20:47:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Remove Jennifer from the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/879131 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[20:47:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Remove Jennifer from the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/879131 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[20:47:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:50:38] <wikibugs>	 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) @Andrew Is it not maybe 65.19.157.35 ?  Because that is the only IP in there and it fails to reso...
[20:52:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:56:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P43025 and previous config saved to /var/cache/conftool/dbconfig/20230111-205654-marostegui.json
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T2100).
[21:00:05] <jouncebot>	 jan_drewniak and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:17] <kindrobot>	 I can deploy
[21:01:49] <MatmaRex>	 hi
[21:01:56] <jan_drewniak>	 o/
[21:02:24] <kindrobot>	 Great. Give me just a moment. We'll start with jan_drewniak 
[21:02:33] <MatmaRex>	 jan_drewniak: do you want to backport to wmf.17 too, or  only wmf.18? i see you have a backport patch but it's not listed on the calendar
[21:03:34] <jan_drewniak>	 MatmaRex: yeah I made two but I think only wmf.18 is required
[21:05:47] <kindrobot>	 jan_drewniak: I'm going to sync both of yours at the same time since one is going to wmf.18 and the other to beta.
[21:06:05] <kindrobot>	 !log start UTC late backport window
[21:06:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:07] <jan_drewniak>	 that's fine with me
[21:07:29] <kindrobot>	 OK, starting merge.
[21:07:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:07:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879094 (https://phabricator.wikimedia.org/T326682) (owner: 10Jdrewniak)
[21:07:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879121 (owner: 10Jdlrobson)
[21:08:03] <wikibugs>	 (03PS6) 10BCornwall: varnish: Template out thread pool settings [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723)
[21:08:32] <wikibugs>	 (03Merged) 10jenkins-bot: Enable page tools on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879121 (owner: 10Jdlrobson)
[21:09:16] <wikibugs>	 (03CR) 10BCornwall: varnish: Template out thread pool settings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[21:09:29] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39098/console" [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[21:12:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321391)', diff saved to https://phabricator.wikimedia.org/P43027 and previous config saved to /var/cache/conftool/dbconfig/20230111-211200-marostegui.json
[21:12:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance
[21:12:05] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[21:12:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance
[21:12:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T321391)', diff saved to https://phabricator.wikimedia.org/P43028 and previous config saved to /var/cache/conftool/dbconfig/20230111-211222-marostegui.json
[21:12:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:13:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Remove Jennifer from the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/879131 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[21:14:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321391)', diff saved to https://phabricator.wikimedia.org/P43029 and previous config saved to /var/cache/conftool/dbconfig/20230111-211445-marostegui.json
[21:15:57] <wikibugs>	 (03PS1) 10Dzahn: phabricator: add test for /r/ redirect to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/879137 (https://phabricator.wikimedia.org/T324311)
[21:17:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:22:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:23:17] <wikibugs>	 (03Merged) 10jenkins-bot: Fix mustache template rendering when TOC is rerendered after an edit [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879094 (https://phabricator.wikimedia.org/T326682) (owner: 10Jdrewniak)
[21:23:43] <logmsgbot>	 !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:879094|Fix mustache template rendering when TOC is rerendered after an edit (T326682)]], [[gerrit:879121|Enable page tools on beta cluster]]
[21:23:47] <stashbot>	 T326682: [Regression, production] Vector 2022 TOC disappears, becomes "undefined" after saving an edit with DiscussionTools, VisualEditor - https://phabricator.wikimedia.org/T326682
[21:25:27] <logmsgbot>	 !log kindrobot@deploy1002 kindrobot and jdrewniak and jdlrobson: Backport for [[gerrit:879094|Fix mustache template rendering when TOC is rerendered after an edit (T326682)]], [[gerrit:879121|Enable page tools on beta cluster]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[21:25:53] <kindrobot>	 jan_drewniak: could you please confirm your wmf.18 patch?
[21:27:22] <jan_drewniak>	 kindrobot: yup, looks good!
[21:27:47] <kindrobot>	 Great. Syncing.
[21:28:22] <MatmaRex>	 kindrobot: considering that the process seems to be taking a long time today, how about doing my backports all at once?
[21:28:54] <kindrobot>	 That's fine with me.
[21:29:46] <MatmaRex>	 note that there's a dependency between the two wmf.17 patches, but i think that's fine
[21:29:48] <MatmaRex>	 thanks
[21:29:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P43030 and previous config saved to /var/cache/conftool/dbconfig/20230111-212952-marostegui.json
[21:30:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 129 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:32:46] <kindrobot>	 Will you be able to test out your wmf.17 patches as opposed to your wmf.18 patches on the test servers if they're deployed together?
[21:33:04] <kindrobot>	 MatmaRex ^
[21:33:38] <MatmaRex>	 yeah
[21:33:58] <MatmaRex>	 we have wikis on both .17 and .18, right?
[21:34:01] <logmsgbot>	 !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:879094|Fix mustache template rendering when TOC is rerendered after an edit (T326682)]], [[gerrit:879121|Enable page tools on beta cluster]] (duration: 10m 17s)
[21:34:04] <stashbot>	 T326682: [Regression, production] Vector 2022 TOC disappears, becomes "undefined" after saving an edit with DiscussionTools, VisualEditor - https://phabricator.wikimedia.org/T326682
[21:34:13] <jeena>	 group0 is on .18
[21:34:24] <kindrobot>	 Ah, OK.
[21:34:31] <MatmaRex>	 i don't need to make any edits to test these, so i can just test on wikipedias
[21:35:39] <kindrobot>	 OK, MatmaRex. I'm getting ready to start your merges.
[21:37:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:38:20] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 133 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:38:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878154 (owner: 10Bartosz Dziewoński)
[21:38:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879100 (owner: 10Bartosz Dziewoński)
[21:38:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński)
[21:38:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879099 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński)
[21:39:06] <jeena>	 btw you should be able to just provide all the change numbers to scap backport for merging/deploy
[21:39:35] <jeena>	 oh as you did :P
[21:40:08] <kindrobot>	 :)
[21:44:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P43031 and previous config saved to /var/cache/conftool/dbconfig/20230111-214458-marostegui.json
[21:52:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:57:12] <wikibugs>	 (03Merged) 10jenkins-bot: Fix exception in `<gallery mode="slideshow">` with missing images [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878154 (owner: 10Bartosz Dziewoński)
[21:57:18] <wikibugs>	 (03Merged) 10jenkins-bot: Fix phan error when Excimer is enabled [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879100 (owner: 10Bartosz Dziewoński)
[21:57:27] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński)
[21:57:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879099 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński)
[21:58:01] <logmsgbot>	 !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:878154|Fix exception in `<gallery mode="slideshow">` with missing images]], [[gerrit:879100|Fix phan error when Excimer is enabled]], [[gerrit:879098|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T301063 T326399)]], [[gerrit:879099|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T301063
[21:58:01] <logmsgbot>	 T326399)]]
[21:58:05] <stashbot>	 T301063: The "tag name" on the change line should link directly to "tagged changes" - https://phabricator.wikimedia.org/T301063
[21:58:06] <stashbot>	 T326399: (other edits) links repetitive and long - https://phabricator.wikimedia.org/T326399
[21:58:12] <MatmaRex>	 i can test things whenever they're on mwdebug
[21:58:42] <kindrobot>	 Great. Should be soon. I'll ping you.
[22:00:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321391)', diff saved to https://phabricator.wikimedia.org/P43033 and previous config saved to /var/cache/conftool/dbconfig/20230111-220005-marostegui.json
[22:00:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance
[22:00:10] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[22:00:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance
[22:00:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43034 and previous config saved to /var/cache/conftool/dbconfig/20230111-220026-marostegui.json
[22:01:56] <wikibugs>	 (03Abandoned) 10Bartosz Dziewoński: Fix mustache template rendering when TOC is rerendered after an edit [skins/Vector] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879093 (https://phabricator.wikimedia.org/T326682) (owner: 10Jdrewniak)
[22:02:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43035 and previous config saved to /var/cache/conftool/dbconfig/20230111-220251-marostegui.json
[22:03:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) Alright, finally getting back to this.  So the request is that the group "deployment", which is already on the canary_appserver role on mwdebug hosts...
[22:07:28] <kindrobot>	 Sorry it's taking so long. Not sure what's holding it up.
[22:08:12] <jeena>	 what's the last output you got?
[22:10:43] <kindrobot>	 jeena: sorry my tmux session got weird
[22:11:03] <kindrobot>	 It's actually is making more progress.
[22:11:08] <jeena>	 oh good
[22:11:19] <kindrobot>	 It was on K8s image build/push
[22:11:29] <kindrobot>	 Now it's on sync-masters
[22:11:32] <jeena>	 ah yeah that can take a while
[22:12:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:13:01] <dancy>	 Does this link work for yall: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.7.0-5-2023.02?id=WyXhooUBPP0fLdos6gAX
[22:13:46] <wikibugs>	 (03PS1) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979)
[22:13:46] <jeena>	 zabe: sorry I did not see your message until now for some reason. If you still want to add your config change after backports are done and before I deploy today that would be fine with me
[22:14:02] <kindrobot>	 Yes, I can see it dancy.
[22:14:03] <zabe>	 yeah that would be cool
[22:14:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn)
[22:14:20] <zabe>	 dancy, works for me
[22:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[22:15:42] <dancy>	 thx
[22:16:59] <wikibugs>	 (03PS2) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979)
[22:17:18] <dancy>	 https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.7.0-5-2023.02?id=MyHYooUBPP0fLdoscyIC
[22:17:33] <dancy>	 476 l10n files rebuilt.
[22:17:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn)
[22:17:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P43036 and previous config saved to /var/cache/conftool/dbconfig/20230111-221757-marostegui.json
[22:17:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) @daniel This was for you, remember that?
[22:18:54] <dancy>	 Looks like it was a fresh checkout of wmf.17?  Jeena does that track?
[22:19:14] <jeena>	 oh? that seems weird
[22:19:16] <dancy>	 oh I may be misinterpreting a message.. disregard.
[22:19:29] <dancy>	 ah yes, it always says "successfully checked out"  nvm.
[22:19:36] <jeena>	 phew lol
[22:19:39] <wikibugs>	 (03PS3) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979)
[22:21:24] <wikibugs>	 (03PS1) 10Zabe: Start writing to rev_comment_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954)
[22:21:37] <logmsgbot>	 !log kindrobot@deploy1002 kindrobot and matmarex: Backport for [[gerrit:878154|Fix exception in `<gallery mode="slideshow">` with missing images]], [[gerrit:879100|Fix phan error when Excimer is enabled]], [[gerrit:879098|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T301063 T326399)]], [[gerrit:879099|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view
[22:21:38] <logmsgbot>	 " (T301063 T326399)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[22:21:42] <stashbot>	 T301063: The "tag name" on the change line should link directly to "tagged changes" - https://phabricator.wikimedia.org/T301063
[22:21:42] <stashbot>	 T326399: (other edits) links repetitive and long - https://phabricator.wikimedia.org/T326399
[22:22:07] <kindrobot>	 MatmaRex: we made it! Could you please confirm?
[22:22:14] <mutante>	 hey deployers, have you ever thought "I wish I could temp disable puppet on mwdebug" ?
[22:22:21] <MatmaRex>	 looking
[22:22:22] <mutante>	 I know I have been asked about it
[22:22:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:23:01] <MatmaRex>	 kindrobot: everything looks good
[22:23:34] <kindrobot>	 OK great. Syncing.
[22:23:59] <dancy>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/879098 changes l10n files.. so it's when that was backported that I would expect to see the l10n rebuild happen.
[22:24:05] <jeena>	 mutante: are you making that happen? :P personally I have not considered it but I am probably an outlier
[22:24:30] <mutante>	 jeena: yes:) I am trying to make that happen. https://gerrit.wikimedia.org/r/c/operations/puppet/+/879147
[22:24:42] <mutante>	 a ticket that's been sitting there for a while
[22:25:20] <jeena>	 cool!
[22:26:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn)
[22:27:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn)
[22:27:58] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) >>! In T305979#7976119, @MoritzMuehlenhoff wrote: > This was discussed in the Infrastructure Foundation...
[22:28:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) a:05Dzahn→03None
[22:29:43] <wikibugs>	 (03CR) 10Dzahn: "Alex, Effie, I almost forgot entirely about this. Does it make sense to keep it open or is this one of those cases where thumbor is moving" [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn)
[22:31:10] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10lmata)
[22:31:38] <wikibugs>	 (03CR) 10Dzahn: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[22:32:46] <icinga-wm>	 PROBLEM - Check systemd state on people2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:32:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:33:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P43037 and previous config saved to /var/cache/conftool/dbconfig/20230111-223304-marostegui.json
[22:35:08] <kindrobot>	 How did this person make this video? Is the camera strapped to their head?
[22:37:33] <kindrobot>	 Ooops, wrong channel x_x;;
[22:37:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:38:07] <logmsgbot>	 !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:878154|Fix exception in `<gallery mode="slideshow">` with missing images]], [[gerrit:879100|Fix phan error when Excimer is enabled]], [[gerrit:879098|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T301063 T326399)]], [[gerrit:879099|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T30106
[22:38:07] <logmsgbot>	 3 T326399)]] (duration: 40m 05s)
[22:38:09] <kindrobot>	 I'm not watching youtube while the deploy is finishing. ;)
[22:38:13] <stashbot>	 T301063: The "tag name" on the change line should link directly to "tagged changes" - https://phabricator.wikimedia.org/T301063
[22:38:13] <stashbot>	 T326399: (other edits) links repetitive and long - https://phabricator.wikimedia.org/T326399
[22:38:16] <stashbot>	 T30106: Problem with port setting when web server is behind NAT - https://phabricator.wikimedia.org/T30106
[22:38:57] <kindrobot>	 Speaking of, the deploy just finished. Thank you jan_drewniak and MatmaRex. Sorry that took so long.
[22:39:07] <wikibugs>	 (03PS5) 10Dzahn: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[22:39:10] <MatmaRex>	 thanks!
[22:39:10] <kindrobot>	 !log close UTC late backport window
[22:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[22:39:35] <wikibugs>	 (03CR) 10Dzahn: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[22:40:01] <zabe>	 jeena, I would quickly push through my config changes if that is still fine with this
[22:40:18] <jeena>	 Yup, lmk if you need anything from me zabe
[22:40:28] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Start reading from cuc_actor on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879055 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[22:40:36] <wikibugs>	 (03PS2) 10Zabe: Start writing to rev_comment_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954)
[22:40:41] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Start writing to rev_comment_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[22:40:43] <effie>	 !log upload memkeys_20181031-2~bullseye0_ on bullseye-wikimedia
[22:40:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:48] <wikibugs>	 (03PS2) 10Zabe: Stop writing to cul_user and cul_user_text on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004)
[22:40:53] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Stop writing to cul_user and cul_user_text on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[22:41:14] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from cuc_actor on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879055 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[22:41:27] <wikibugs>	 (03PS6) 10Dzahn: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[22:41:29] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to rev_comment_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[22:41:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[22:41:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[22:41:35] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to cul_user and cul_user_text on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[22:41:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[22:42:12] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:879055|Start reading from cuc_actor on group0 and group1 wikis (T233004)]], [[gerrit:879148|Start writing to rev_comment_id on group0 wikis (T299954)]], [[gerrit:879057|Stop writing to cul_user and cul_user_text on testwiki (T233004)]]
[22:42:17] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[22:42:18] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[22:43:58] <logmsgbot>	 !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:879055|Start reading from cuc_actor on group0 and group1 wikis (T233004)]], [[gerrit:879148|Start writing to rev_comment_id on group0 wikis (T299954)]], [[gerrit:879057|Stop writing to cul_user and cul_user_text on testwiki (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[22:44:06] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:47:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:48:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43038 and previous config saved to /var/cache/conftool/dbconfig/20230111-224810-marostegui.json
[22:48:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance
[22:48:15] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[22:48:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance
[22:48:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43039 and previous config saved to /var/cache/conftool/dbconfig/20230111-224832-marostegui.json
[22:50:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43040 and previous config saved to /var/cache/conftool/dbconfig/20230111-225056-marostegui.json
[22:51:40] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:879055|Start reading from cuc_actor on group0 and group1 wikis (T233004)]], [[gerrit:879148|Start writing to rev_comment_id on group0 wikis (T299954)]], [[gerrit:879057|Stop writing to cul_user and cul_user_text on testwiki (T233004)]] (duration: 09m 28s)
[22:51:44] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[22:51:44] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[22:52:13] <zabe>	 jeena, over to you
[22:52:23] <jeena>	 Thanks zabe
[22:52:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:56:08] <wikibugs>	 (03PS7) 10Dzahn: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[23:02:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:05:31] <jeena>	 jouncebot: now
[23:05:31] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 54 minute(s)
[23:06:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P43041 and previous config saved to /var/cache/conftool/dbconfig/20230111-230603-marostegui.json
[23:07:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q3:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH)
[23:07:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q3:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH)
[23:07:51] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879152 (https://phabricator.wikimedia.org/T325581)
[23:07:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879152 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot)
[23:08:33] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879152 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot)
[23:15:55] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.18  refs T325581
[23:15:58] <stashbot>	 T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581
[23:21:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P43042 and previous config saved to /var/cache/conftool/dbconfig/20230111-232109-marostegui.json
[23:21:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH)
[23:22:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:22:52] <logmsgbot>	 !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.18  refs T325581 (duration: 06m 57s)
[23:22:56] <stashbot>	 T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581
[23:32:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:36:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43043 and previous config saved to /var/cache/conftool/dbconfig/20230111-233616-marostegui.json
[23:36:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance
[23:36:21] <stashbot>	 T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391
[23:36:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance
[23:36:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance
[23:36:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance
[23:36:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T321391)', diff saved to https://phabricator.wikimedia.org/P43044 and previous config saved to /var/cache/conftool/dbconfig/20230111-233652-marostegui.json
[23:37:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:39:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T321391)', diff saved to https://phabricator.wikimedia.org/P43045 and previous config saved to /var/cache/conftool/dbconfig/20230111-233916-marostegui.json
[23:47:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:52:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:53:42] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Track callers of parseRevisionParsoidHtml. [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879101
[23:54:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P43047 and previous config saved to /var/cache/conftool/dbconfig/20230111-235423-marostegui.json