[00:01:19] sirenbot: do your thing [00:34:10] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:52] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd0d52d5280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [00:34:52] org/wiki/Search%23Administration [00:35:44] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:28] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 667, active_shards: 1509, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [00:36:28] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:46:20] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:09:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:37:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:46] (JobUnavailable) firing: (7) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:20] (03CR) 10Eevans: [C: 03+1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [01:50:52] (03CR) 10Eevans: [C: 03+1] hiera: move swift credentials into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [01:57:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:59:26] (03CR) 10Eevans: [C: 03+1] swift: move accounts_keys to common hiera global_account_keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [02:07:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [02:17:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:19:37] (03PS1) 10Bartosz Dziewoński: Fix exception in `` with missing images [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878154 [02:22:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:56] (03CR) 10Cwhite: [C: 03+1] kafka-logging: add kafka-logging200[45] to codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [02:46:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 199 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:48:08] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:57:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:47:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:52:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:55:17] 10SRE, 10SRE-OnFire, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10lmata) 05Open→03Resolved a:03lmata >>! In T266807#8495179, @akosiaris wrote: > Thi... [04:13:30] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [04:16:36] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:21:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:28:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:37:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:37:54] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:47:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:48:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet [05:48:28] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet [05:52:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:54:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet [05:55:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet [06:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:33:17] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) After collecting some correct data, and discussing the matter with @Krinkle , we don't think we have a strict need for onhost memcached at the moment if not for releivin... [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T0700) [07:09:53] (03CR) 10Ayounsi: P:environment: Add ablilty to inject environment variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [07:16:23] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) From JTAC: > This message “Read-only file system” suggest file system issues. I found one case with same behavior and the upgrade had to do it with... [07:17:45] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) 05Open→03Resolved All good, thanks a lot! [07:43:54] PROBLEM - SSH on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:51:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:56:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:57:20] PROBLEM - Check systemd state on wdqs2010 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:36] (03PS1) 10JMeybohm: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) [07:58:38] (03PS1) 10JMeybohm: cert-manager: Set leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/878752 (https://phabricator.wikimedia.org/T325292) [08:00:04] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T0800). Please do the needful. [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:06] Looks like I didn't put patch in proper window :D [08:02:49] Oh, didn't put my name in the ircnick template :/ [08:03:24] I'll go ahead with only patch I've for backport. [08:04:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877223 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry) [08:06:20] PROBLEM - SSH on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:08:22] PROBLEM - Check systemd state on wdqs2010 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:17] (03CR) 10Ayounsi: O:rpkivalidator: add bgpalerter to rpki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [08:11:34] PROBLEM - Check systemd state on wdqs2011 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:08] (03CR) 10Muehlenhoff: admin: Add Jennifer Hancock to the datacenter-ops group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [08:13:46] RECOVERY - SSH on wdqs2010 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/878063 (https://phabricator.wikimedia.org/T326316) (owner: 10Jbond) [08:15:14] PROBLEM - SSH on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:18:38] PROBLEM - SSH on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:19:23] (03Merged) 10jenkins-bot: CX: Fix transformation of TranslationUnitDTO to custom array [extensions/ContentTranslation] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877223 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry) [08:19:59] !log kartik@deploy1002 Started scap: Backport for [[gerrit:877223|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] [08:20:02] T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278 [08:21:48] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:877223|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:25:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/878014 (https://phabricator.wikimedia.org/T325387) (owner: 10Muehlenhoff) [08:25:30] PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:43] (03CR) 10MVernon: swift: move accounts_keys to common hiera global_account_keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [08:28:52] RECOVERY - Check systemd state on wdqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:34] RECOVERY - SSH on wdqs2010 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:30:26] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:31:44] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:877223|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] (duration: 11m 45s) [08:31:47] T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278 [08:32:00] 10SRE, 10Infrastructure-Foundations: Upgrade ferm to 2.5.1 - https://phabricator.wikimedia.org/T248954 (10MoritzMuehlenhoff) 05Open→03Declined We won't update Buster hosts to 2.5.1 anymore, these will only be around for some more months anyway and all energy is better spent on migrating these systems to Bu... [08:32:28] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:23] (03PS1) 10Ayounsi: depool eqsin for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/878854 (https://phabricator.wikimedia.org/T316532) [08:34:24] (03CR) 10MVernon: swift: move accounts_keys to common hiera global_account_keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [08:34:36] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:34:39] (03PS2) 10Muehlenhoff: os-reports: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870558 [08:34:43] (03CR) 10Ayounsi: [C: 03+2] depool eqsin for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/878854 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [08:35:02] No more patches in UTC morning backport window. [08:36:47] (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870558 (owner: 10Muehlenhoff) [08:38:18] RECOVERY - Check systemd state on wdqs2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:28] PROBLEM - SSH on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:42:46] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:43:26] RECOVERY - SSH on wdqs2012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:48:22] PROBLEM - SSH on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:48:22] (03PS2) 10Giuseppe Lavagetto: trafficserver: remove support for "php7.4" option in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867541 [08:50:18] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [08:51:59] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: move swift credentials into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [08:52:12] PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:20] (03PS4) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [08:53:09] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: move accounts_keys to common hiera global_account_keys [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [08:53:37] (03CR) 10Jcrespo: [C: 03+1] "Looks good to me, please deploy at any time." [puppet] - 10https://gerrit.wikimedia.org/r/878186 (owner: 10Eevans) [08:58:05] (03CR) 10Gehel: [C: 03+2] wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper) [08:58:23] (03PS2) 10Gehel: wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper) [08:59:16] RECOVERY - SSH on wdqs2012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:15:48] RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:00] (03PS7) 10Hashar: Display Zuul status of jobs for a change in Gerrit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859127 (https://phabricator.wikimedia.org/T214068) [09:17:58] (03CR) 10Hashar: [C: 03+1] httpbb: add SPDX license headers for some test files [puppet] - 10https://gerrit.wikimedia.org/r/878205 (owner: 10Dzahn) [09:23:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:25:53] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: remove support for "php7.4" option in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867541 (owner: 10Giuseppe Lavagetto) [09:26:45] (03CR) 10Muehlenhoff: [C: 03+2] cassandra: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870847 (owner: 10Muehlenhoff) [09:28:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:28:58] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:01] (03PS1) 10Muehlenhoff: package_builder: Also install the hooks for sid [puppet] - 10https://gerrit.wikimedia.org/r/878856 [09:32:39] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/878856 (owner: 10Muehlenhoff) [09:33:46] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:36:25] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Also install the hooks for sid [puppet] - 10https://gerrit.wikimedia.org/r/878856 (owner: 10Muehlenhoff) [09:41:02] gehel: I think this is something you may know about, but please correct me if it is the wrong team. WDQS has an outdated SPARQ check, should I file a ticket about that? [09:49:28] !log installing python3.7 security updates [09:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:32] (03PS1) 10Vgutierrez: prometheus: Fix job:haproxy_frontend_http_responses_total:rate2m [puppet] - 10https://gerrit.wikimedia.org/r/878858 (https://phabricator.wikimedia.org/T288196) [09:55:12] (03CR) 10Hashar: opensearch: make upgrade-phatality.sh stricter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [09:57:12] (03PS3) 10Hashar: opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) [09:57:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, by convention the metric should be job_code:etcetc (i.e. list the aggregation variables). Though in this case we have already the va" [puppet] - 10https://gerrit.wikimedia.org/r/878858 (https://phabricator.wikimedia.org/T288196) (owner: 10Vgutierrez) [09:57:36] (03PS1) 10Jelto: add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) [09:57:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn) [09:58:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn) [09:58:16] (03CR) 10CI reject: [V: 04-1] add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) (owner: 10Jelto) [09:58:36] (03CR) 10Hashar: opensearch: make upgrade-phatality.sh stricter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [09:59:42] (03PS2) 10Jelto: add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) [10:01:06] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Fix job:haproxy_frontend_http_responses_total:rate2m [puppet] - 10https://gerrit.wikimedia.org/r/878858 (https://phabricator.wikimedia.org/T288196) (owner: 10Vgutierrez) [10:02:26] !log asw1-eqsin> request system reboot all-members - T316532 [10:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:29] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [10:04:34] (03PS24) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [10:05:00] PROBLEM - VRRP status on cr3-eqsin is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [10:05:35] expected [10:05:38] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 10 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:06:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39057/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [10:06:32] (virtual-chassis crash) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [10:07:29] you can ignore that too [10:07:38] (03PS3) 10Jelto: add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) [10:07:56] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:08:39] says codfw, so I guess it's not related ^ [10:08:50] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Connect - [10:08:50] , AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:08:59] expected ^ [10:09:00] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:09:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) (owner: 10Jelto) [10:11:41] (03PS25) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [10:12:44] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39058/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [10:13:18] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:13:55] (03CR) 10Jelto: [C: 03+2] add Stephane Rebai to ldap/wmf group [puppet] - 10https://gerrit.wikimedia.org/r/878859 (https://phabricator.wikimedia.org/T326655) (owner: 10Jelto) [10:13:57] (03PS26) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [10:13:59] (03PS1) 10Jbond: bgpalerter: add authorizationHeader and use yml vs yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/878865 [10:14:08] RECOVERY - VRRP status on cr3-eqsin is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [10:15:02] (03PS1) 10Zabe: Simplify expensive check [extensions/3D] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878160 (https://phabricator.wikimedia.org/T326690) [10:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:15:20] half of the switch stack came back online fine... [10:15:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39059/console" [puppet] - 10https://gerrit.wikimedia.org/r/878865 (owner: 10Jbond) [10:16:36] !log installing postgresql-11 security updates [10:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:38] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:31] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10Jelto) p:05Triage→03Medium a:03Jelto Thanks for opening the request. I added you to: * ldap/wmf group * wmf-nda phabricator group * to [data.yaml](https://gerrit... [10:17:32] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 5 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:18:25] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye [10:18:33] (03PS2) 10Zabe: Start reading from cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877249 (https://phabricator.wikimedia.org/T233004) [10:18:48] (03CR) 10Zabe: [C: 03+2] Start reading from cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877249 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [10:18:52] (03CR) 10Zabe: [C: 03+2] Simplify expensive check [extensions/3D] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878160 (https://phabricator.wikimedia.org/T326690) (owner: 10Zabe) [10:19:11] I'll have to follow up with jtac, something is busted on one of the two switches... [10:19:38] (03Merged) 10jenkins-bot: Start reading from cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877249 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [10:20:51] (03Merged) 10jenkins-bot: Simplify expensive check [extensions/3D] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878160 (https://phabricator.wikimedia.org/T326690) (owner: 10Zabe) [10:21:24] !log zabe@deploy1002 Started scap: Backport for [[gerrit:878160|Simplify expensive check (T326690)]], [[gerrit:877249|Start reading from cuc_actor on test wikis (T233004)]] [10:21:29] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [10:21:29] T326690: PHP Deprecated: HTMLFormField::__construct: Constructing an HTMLFormField without a 'parent' parameter [Called from Licenses::__construct] - https://phabricator.wikimedia.org/T326690 [10:21:32] (virtual-chassis crash) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [10:23:13] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:878160|Simplify expensive check (T326690)]], [[gerrit:877249|Start reading from cuc_actor on test wikis (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [10:23:39] !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid test cluster: Reboot Druid nodes [10:24:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw1486.eqiad.wmnet with reason: hardware troubleshooting [10:25:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw1486.eqiad.wmnet with reason: hardware troubleshooting [10:25:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=edb03633-d9b6-4a06-849d-2c3da0e62688) set by cgoubert@cumin1001 for 7 days,... [10:26:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10Jelto) According to https://www.mediawiki.org/wiki/Gerrit/Privilege_policy you should also have Gerrit +2 from your ldap/wmf membership. So you should have the request... [10:26:25] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10Jelto) a:05Jelto→03StephaneRebai [10:26:43] (03PS1) 10Ilias Sarantopoulos: ml-services: multi-processing changes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 [10:27:14] (03PS2) 10Ilias Sarantopoulos: ml-services: multi-processing changes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) [10:29:02] (03CR) 10Filippo Giunchedi: "Only nits/minor things really" [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [10:29:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] bgpalerter: add authorizationHeader and use yml vs yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/878865 (owner: 10Jbond) [10:30:59] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:878160|Simplify expensive check (T326690)]], [[gerrit:877249|Start reading from cuc_actor on test wikis (T233004)]] (duration: 09m 34s) [10:31:04] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [10:31:04] T326690: PHP Deprecated: HTMLFormField::__construct: Constructing an HTMLFormField without a 'parent' parameter [Called from Licenses::__construct] - https://phabricator.wikimedia.org/T326690 [10:32:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [10:34:21] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [10:36:29] (03PS27) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [10:36:31] (03PS1) 10Jbond: bgpalerter: add defaults for bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/878867 [10:36:45] (03CR) 10Jbond: [V: 03+2 C: 03+2] bgpalerter: add defaults for bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/878867 (owner: 10Jbond) [10:37:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [10:37:49] zabe: Thanks for backporting the follow up on the HTMLForm thing :) [10:39:26] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [10:39:28] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) fpc0 went back up fine, but fpc1 not so much... It's not fully booting and stuck at a busybox like shell. Root password works so that means the con... [10:40:37] (03CR) 10Ayounsi: [C: 03+1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [10:41:46] yw [10:43:40] huh, redlinks are broken on mw.org [10:44:22] or maybe just in flow [10:44:26] (03CR) 10Jcrespo: "Could you clarify comments 1 and 2, 3 I will fix right away." [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [10:44:53] (03PS28) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [10:44:55] (03PS1) 10Jbond: bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868 [10:45:28] (03CR) 10CI reject: [V: 04-1] bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868 (owner: 10Jbond) [10:45:44] (03PS1) 10Volans: dhcp: fix tests using unnecessary hack [software/spicerack] - 10https://gerrit.wikimedia.org/r/878869 [10:45:58] (03CR) 10Volans: [C: 04-1] "That's actually kinda correct as a report, the there is an error in the tests. I've sent I8adace301ff730e5f311ea233266565946f0d9ae to fix " [software/spicerack] - 10https://gerrit.wikimedia.org/r/878172 (owner: 10Jbond) [10:46:02] (03PS1) 10Zabe: Start reading from cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878870 (https://phabricator.wikimedia.org/T233004) [10:47:21] jynus: sorry for the delay. Yes, please create a phab task. How is it outdated? [10:47:41] I just diffed deeper and the check is right, I was confused by it [10:47:45] *digged [10:47:59] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [10:48:04] I wonder if it is T323096 and an expired downtime, gehel [10:48:04] T323096: WDQS Data Reload - https://phabricator.wikimedia.org/T323096 [10:48:22] in that case, just a new longer downtime should do the trick [10:48:24] 10SRE, 10Infrastructure-Foundations: Design and implement async LDAP operations - https://phabricator.wikimedia.org/T320427 (10SLyngshede-WMF) 05In progress→03Resolved We're going with Django-RQ as it's simpler and does not require Celery. [10:48:39] wdqs is returning 400 on those hosts, hence the error [10:49:12] Oh, might be that data reload. I'll have a look (cc inflatador, ryankemper) [10:49:15] I was about to ask on T323096 [10:49:32] maybe this is not something you are in charge of [10:49:44] (03PS2) 10Jbond: bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868 [10:50:05] (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_aux-k8s-ingress.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:50:39] (03PS3) 10Jbond: bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868 [10:50:41] (03PS29) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [10:52:19] jynus: Ryan and Brian are working on that data reload, and it has not been going as planned :/ But I have some knowledge of what's going on. [10:52:22] (03CR) 10Jbond: [C: 03+2] bgpalerter: add profile [puppet] - 10https://gerrit.wikimedia.org/r/878868 (owner: 10Jbond) [10:52:29] I'll extend the downtime [10:52:44] oh, I had just commented: https://phabricator.wikimedia.org/T323096#8516090 [10:53:05] you can also ack, which will disable the alerts until they work again so they don't expire, up to you [10:53:32] feel free to comment there if you take action so they don't have to [10:53:56] (03CR) 10Klausman: [C: 03+1] ml-services: multi-processing changes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [10:54:09] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:54:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1001.eqiad.wmnet with OS bullseye [10:55:05] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_aux-k8s-ingress.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:55:35] jynus: It's related more to T301167. I've added a week of downtime (cc: inflatador, ryankemper) [10:55:35] T301167: Service implementation for wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T301167 [10:56:00] I see, thank you! [10:56:07] sorry for the ping [10:56:49] so initially I had thought that the string returned was outdated and the check needed changes [10:57:23] but it turned it was a service returning 400 code when I digged deeper [10:58:36] (03PS10) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [10:59:10] (03PS11) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1100) [11:00:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] docker::baseimages: inject no_proxy config to rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/878063 (https://phabricator.wikimedia.org/T326316) (owner: 10Jbond) [11:00:56] (03PS1) 10Muehlenhoff: Limit the installed hooks for sid [puppet] - 10https://gerrit.wikimedia.org/r/878871 [11:04:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/878869 (owner: 10Volans) [11:04:25] (03Abandoned) 10Jbond: dhcp: disable no-member check [software/spicerack] - 10https://gerrit.wikimedia.org/r/878172 (owner: 10Jbond) [11:04:30] (03PS12) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [11:05:24] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39061/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [11:06:09] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:18] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] aux_k8s::worker: Include P::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/878872 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [11:09:51] (03PS13) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [11:12:41] ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 80 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish [11:12:41] ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3127 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish [11:12:41] ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3126 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish [11:12:41] ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3125 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish [11:12:41] ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3124 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Varnish [11:12:45] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1003.eqiad.wmnet with OS bullseye [11:13:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/878871 (owner: 10Muehlenhoff) [11:14:03] (03PS12) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [11:14:04] ACKNOWLEDGEMENT - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/DNS [11:14:04] ACKNOWLEDGEMENT - SSH on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:14:04] ACKNOWLEDGEMENT - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/NTP [11:14:04] ACKNOWLEDGEMENT - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - query timed out ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/DNS [11:14:04] ACKNOWLEDGEMENT - Host dns5004 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T316532 [11:15:02] !log btullis@cumin1001 END (FAIL) - Cookbook sre.druid.reboot-workers (exit_code=99) for Druid test cluster: Reboot Druid nodes [11:15:44] (03CR) 10Volans: [C: 03+2] dhcp: fix tests using unnecessary hack [software/spicerack] - 10https://gerrit.wikimedia.org/r/878869 (owner: 10Volans) [11:15:59] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:15:59] ACKNOWLEDGEMENT - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T316532 [11:16:01] !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) [11:16:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool-data: Add aux-k8s-workers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/878874 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [11:16:19] ACKNOWLEDGEMENT - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/DNS [11:16:46] (03CR) 10Filippo Giunchedi: [C: 03+1] conftool-data: Add aux-k8s-workers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/878874 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [11:17:07] (03CR) 10Clément Goubert: [C: 03+2] conftool-data: Add aux-k8s-workers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/878874 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [11:18:44] ACKNOWLEDGEMENT - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 5 ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:18:44] ACKNOWLEDGEMENT - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Connect - Anycast ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:44] ACKNOWLEDGEMENT - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 5 ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:18:44] ACKNOWLEDGEMENT - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:50] (03PS1) 10Jbond: bgpalerter: move authorizationHeader to ris section [puppet] - 10https://gerrit.wikimedia.org/r/878875 [11:19:15] (03Merged) 10jenkins-bot: dhcp: fix tests using unnecessary hack [software/spicerack] - 10https://gerrit.wikimedia.org/r/878869 (owner: 10Volans) [11:19:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast3006.wikimedia.org [11:19:26] (03PS3) 10Volans: puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [11:19:26] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:19:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39063/console" [puppet] - 10https://gerrit.wikimedia.org/r/878875 (owner: 10Jbond) [11:19:59] ACKNOWLEDGEMENT - Juniper virtual chassis ports on asw1-eqsin is CRITICAL: CRIT: Down: 2 Unknown: 0 ayounsi https://phabricator.wikimedia.org/T316532 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [11:20:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878871 (owner: 10Muehlenhoff) [11:21:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] bgpalerter: move authorizationHeader to ris section [puppet] - 10https://gerrit.wikimedia.org/r/878875 (owner: 10Jbond) [11:21:19] (03CR) 10Muehlenhoff: [C: 03+2] Limit the installed hooks for sid [puppet] - 10https://gerrit.wikimedia.org/r/878871 (owner: 10Muehlenhoff) [11:21:47] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1038.eqiad.wmnet with OS bullseye [11:22:54] (03CR) 10CI reject: [V: 04-1] puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [11:22:57] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:22:59] !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) [11:23:08] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) We tried to boot on the Recovery Junos (both 14 and 20) but the same error happened. Next step is onsite "format install" https://supportportal.ju... [11:24:56] (03CR) 10Volans: "The approach LGTM, couple of nits inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [11:25:49] RECOVERY - Check systemd state on people2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:54] (03CR) 10Filippo Giunchedi: check_legal_terms: Refactor check to make it more robust against changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [11:28:12] (03PS1) 10Jbond: cache::base: move wikimedia and wmcs domains to global level [puppet] - 10https://gerrit.wikimedia.org/r/878876 [11:28:41] RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3006.wikimedia.org - jmm@cumin2002" [11:29:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3006.wikimedia.org - jmm@cumin2002" [11:29:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:29:53] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast3006.wikimedia.org on all recursors [11:29:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39064/console" [puppet] - 10https://gerrit.wikimedia.org/r/878876 (owner: 10Jbond) [11:30:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) bast3006.wikimedia.org on all recursors [11:33:05] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage [11:36:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage [11:38:40] !log cgoubert@cumin1001 conftool action : set/pooled=yes:weight=10; selector: cluster=aux-k8s,service=kubesvc [11:39:23] (03CR) 10AikoChou: [C: 03+1] "Only one suggestion regarding the commit message :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [11:40:33] (03PS1) 10Jbond: bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877 [11:40:44] (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to various base/IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/863305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:40:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] cache::base: move wikimedia and wmcs domains to global level [puppet] - 10https://gerrit.wikimedia.org/r/878876 (owner: 10Jbond) [11:41:17] (03PS14) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [11:41:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) @clements_goubert I checked yesterday afternoon did not see any alerts. Let’s repool server close ticket [11:41:35] moritzm: FI ill merge your SPDX changes as well [11:41:40] please do [11:41:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) 05In progress→03Resolved [11:41:46] * jbond done [11:41:53] thx [11:41:56] np [11:42:21] (03CR) 10CI reject: [V: 04-1] bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877 (owner: 10Jbond) [11:43:08] (03PS13) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [11:43:47] (03PS2) 10Muehlenhoff: Add SPDX headers for various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/860912 (https://phabricator.wikimedia.org/T308013) [11:44:43] (03CR) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [11:44:50] (03CR) 10Ayounsi: [C: 03+1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [11:47:15] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1486.eqiad.wmnet [11:48:23] (03PS1) 10Muehlenhoff: package_builder::pbuilder_hook: Manage the hook directory with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878879 [11:49:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast3006.wikimedia.org [11:50:05] RECOVERY - mediawiki-installation DSH group on mw1486 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:50:20] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1486.eqiad.wmnet [11:50:20] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1486.eqiad.wmnet [11:50:35] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [11:50:36] (ConfdResourceFailed) resolved: (2) confd resource _srv_config-master_pybal_eqiad_aux-k8s-ingress.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:51:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [11:51:23] !log repooled mw1486 in api_appserver eqiad after hardware investigation - T326425 [11:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:26] T326425: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 [11:51:29] (03CR) 10Slyngshede: [C: 03+2] role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [11:51:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) Server repooled, thanks a bunch. [11:52:33] (03CR) 10Volans: "LGTM but I'll leave it to John for the review of the intricacies of recurse/purge" [puppet] - 10https://gerrit.wikimedia.org/r/878879 (owner: 10Muehlenhoff) [11:53:26] (03PS3) 10Ilias Sarantopoulos: ml-services: multi-processing changes for articlequality and drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) [11:53:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [11:54:45] (03CR) 10Ilias Sarantopoulos: ml-services: multi-processing changes for articlequality and drafttopic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [11:55:35] (03PS1) 10Effie Mouzeli: memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 [11:55:59] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: multi-processing changes for articlequality and drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [11:57:21] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:58:17] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp5018.eqsin.wmnet, cp5022.eqsin.wmnet are marked down but pooled: uploadlb_80: Servers cp5028.eqsin.wmnet, cp5030.eqsin.wmnet, cp5032.eqsin.wmnet are marked down but pooled: testlb_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb_443: Servers cp5024.eqsi [11:58:17] cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_80: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet are marked down but pooled: testlb6_80: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet are marked down but pooled: uploadlb6_80: Servers cp5028.eqsin.wmnet, cp5030.eqsin.wmnet, cp5032.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5026.eqsin.wmnet, cp5030.eqsin.wmnet, cp5032.eqs [11:58:17] are marked down but pooled: uploadlb6_443: Servers cp5028.eqsin.wmnet, cp5026.eqsin.wmnet, cp5030.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5022 https://wikitech.wikimedia.org/wiki/PyBal [11:59:06] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a respons [11:59:06] ceived: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:00:26] why delayed alerts, godog ? Do they have a higher timeout? [12:00:37] (03PS1) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 [12:01:08] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:01:12] or maybe it is a new isue in in sin OSPF status on mr1-eqsin is CRITICAL ? [12:01:51] (03Merged) 10jenkins-bot: ml-services: multi-processing changes for articlequality and drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/878866 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [12:01:55] (03CR) 10CI reject: [V: 04-1] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (owner: 10Jbond) [12:02:06] maybe it is flapping? [12:04:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878177 (https://phabricator.wikimedia.org/T313733) (owner: 10Effie Mouzeli) [12:04:20] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test p [12:04:20] ed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:05:22] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:05:54] yeah, it is flapping on and off, I will see if I can downtime it [12:06:58] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb_443: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but [12:06:58] testlb6_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5022.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin. [12:06:58] p5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:08:34] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [12:09:17] I will downtime LVS health checks on 4, 5 and 6 until tomorrow, CC vgutierrez in case they return earlier and have to be be deleted [12:09:41] lvs500X, it is understood [12:10:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [12:10:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1003.eqiad.wmnet with OS bullseye [12:10:17] (they are up, but some backends aren't) [12:10:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1004.eqiad.wmnet with OS bullseye [12:11:24] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/878885 [12:11:42] hopefully that solves the flapping alerts [12:13:34] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:14:18] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:15:58] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/878885 (owner: 10Muehlenhoff) [12:17:16] (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers for various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/860912 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:17:51] (03PS2) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 [12:18:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast4004.wikimedia.org [12:18:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:19:07] (03CR) 10CI reject: [V: 04-1] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (owner: 10Jbond) [12:21:20] (03PS1) 10Slyngshede: C:idm::deployment Add missing package [puppet] - 10https://gerrit.wikimedia.org/r/878928 (https://phabricator.wikimedia.org/T320795) [12:21:37] (03PS3) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 [12:22:17] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment Add missing package [puppet] - 10https://gerrit.wikimedia.org/r/878928 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [12:22:54] (03CR) 10CI reject: [V: 04-1] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (owner: 10Jbond) [12:24:20] (03PS4) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 [12:24:22] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:53] (03PS5) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [12:24:53] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [12:24:55] !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) [12:25:26] (03PS2) 10Effie Mouzeli: memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 [12:25:52] (03CR) 10CI reject: [V: 04-1] memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli) [12:27:18] (03CR) 10Jbond: [C: 03+2] P:environment: Add ablilty to inject environment variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [12:27:43] (03PS6) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [12:27:45] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Updates of passwords of users created with postgresql::user / PostgreSQL change to scram-sha256 - https://phabricator.wikimedia.org/T326325 (10LSobanski) [12:28:56] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:25] (03PS3) 10Effie Mouzeli: memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 [12:29:45] (03CR) 10CI reject: [V: 04-1] memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli) [12:30:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 56630 [12:30:37] (03PS7) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [12:31:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 56630 [12:31:37] (03PS4) 10Effie Mouzeli: memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 [12:33:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4004.wikimedia.org - jmm@cumin2002" [12:34:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4004.wikimedia.org - jmm@cumin2002" [12:34:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:34:40] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast4004.wikimedia.org on all recursors [12:34:50] (03PS1) 10Btullis: Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) [12:35:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) bast4004.wikimedia.org on all recursors [12:36:36] (03PS2) 10Btullis: Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) [12:36:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8849 [12:37:06] (03PS8) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [12:37:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli) [12:38:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39074/console" [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [12:39:08] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:40:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8849 [12:40:09] (03PS9) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [12:40:18] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [12:41:22] (03CR) 10Btullis: "Adding Amir and Manuel for sanity checking please." [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) (owner: 10Btullis) [12:42:40] !log installing postgresql 11 security updates on maps/codfw [12:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:46] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:43:02] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 40): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39075/console" [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli) [12:43:27] (03PS3) 10JMeybohm: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) [12:45:36] (03PS10) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [12:46:18] (03CR) 10Effie Mouzeli: [C: 03+2] site: Remove retired mc* hosts [puppet] - 10https://gerrit.wikimedia.org/r/878177 (https://phabricator.wikimedia.org/T313733) (owner: 10Effie Mouzeli) [12:46:24] (03PS2) 10Effie Mouzeli: site: Remove retired mc* hosts [puppet] - 10https://gerrit.wikimedia.org/r/878177 (https://phabricator.wikimedia.org/T313733) [12:49:15] (03CR) 10Ayounsi: environment: add no_proxy config directly to environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [12:51:03] jynus: not sure tbh, maybe downtime expired [12:51:14] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:53:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast4004.wikimedia.org [12:56:02] (03PS1) 10JMeybohm: coredns: Remove deprecated nodeSelector [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 [12:57:21] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: minor fix for bullseye installation [puppet] - 10https://gerrit.wikimedia.org/r/878881 (owner: 10Effie Mouzeli) [12:58:27] (03CR) 10Muehlenhoff: bgpalerter: add profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878868 (owner: 10Jbond) [12:59:28] (03PS4) 10Jbond: puppet: allow to specify the exact message when disable/enable puppet [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 [12:59:52] (03CR) 10Jbond: "updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [12:59:59] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [13:01:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/878879 (owner: 10Muehlenhoff) [13:01:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast6002.wikimedia.org [13:01:51] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:03:51] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [13:03:52] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1038.eqiad.wmnet with OS bullseye [13:04:33] (03PS2) 10JMeybohm: coredns: Remove deprecated nodeSelector [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 [13:06:27] (03CR) 10Jbond: [C: 03+2] puppet: allow to specify the exact message when disable/enable puppet [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [13:06:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo Legal's take on what words we should be looking for" [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [13:07:29] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1038.eqiad.wmnet with OS bullseye [13:07:46] (03CR) 10Jbond: [C: 03+2] O:rpkivalidator: add bgpalerter to rpki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:09:18] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:25] (03Merged) 10jenkins-bot: puppet: allow to specify the exact message when disable/enable puppet [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [13:11:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast6002.wikimedia.org - jmm@cumin2002" [13:11:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast6002.wikimedia.org - jmm@cumin2002" [13:11:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:11:54] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast6002.wikimedia.org on all recursors [13:12:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) bast6002.wikimedia.org on all recursors [13:12:16] (03PS1) 10Jbond: P:bgpalerter: make sure we create the sysuser before calling the class [puppet] - 10https://gerrit.wikimedia.org/r/878937 [13:12:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Skip empty fixture files [deployment-charts] - 10https://gerrit.wikimedia.org/r/875947 (owner: 10JMeybohm) [13:13:32] PROBLEM - Check systemd state on rpki1001 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:58] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:02] (03PS2) 10Jbond: P:bgpalerter: make sure we create the sysuser before calling the class [puppet] - 10https://gerrit.wikimedia.org/r/878937 [13:15:11] node-bgpalerter.service is expected ill fix [13:15:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878937 (owner: 10Jbond) [13:15:26] (03CR) 10Jbond: [C: 03+2] P:bgpalerter: make sure we create the sysuser before calling the class [puppet] - 10https://gerrit.wikimedia.org/r/878937 (owner: 10Jbond) [13:18:32] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage [13:20:10] PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: host reimage [13:27:53] (03CR) 10Gmodena: Add flink-app-example service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [13:29:18] (03PS1) 10Effie Mouzeli: memcached: minor fix for bullseye installation #2 [puppet] - 10https://gerrit.wikimedia.org/r/878939 [13:31:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast6002.wikimedia.org [13:34:13] (03CR) 10Ottomata: Add flink-app-example service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [13:35:32] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [13:38:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878939 (owner: 10Effie Mouzeli) [13:38:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [13:42:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35753 [13:44:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35753 [13:44:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9584 [13:45:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9584 [13:45:47] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3302 [13:45:55] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade [13:46:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3302 [13:47:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37002 [13:47:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37002 [13:47:44] (03PS4) 10JMeybohm: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) [13:47:46] (03PS3) 10JMeybohm: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729) [13:47:48] (03PS1) 10JMeybohm: Pin coredns, eventrouter and helm-state-metrics for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878940 (https://phabricator.wikimedia.org/T307943) [13:47:50] (03PS1) 10JMeybohm: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729) [13:50:37] (03Abandoned) 10Ayounsi: Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:50:58] (03CR) 10Effie Mouzeli: "PCC NOOP https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39077" [puppet] - 10https://gerrit.wikimedia.org/r/878939 (owner: 10Effie Mouzeli) [13:51:59] (03PS4) 10JMeybohm: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729) [13:52:01] (03PS2) 10JMeybohm: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729) [13:52:03] (03PS1) 10JMeybohm: calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729) [13:52:56] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: minor fix for bullseye installation #2 [puppet] - 10https://gerrit.wikimedia.org/r/878939 (owner: 10Effie Mouzeli) [13:53:08] (03PS2) 10Effie Mouzeli: memcached: minor fix for bullseye installation #2 [puppet] - 10https://gerrit.wikimedia.org/r/878939 [13:54:09] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:25] (03CR) 10Ayounsi: "That's more that 2 years old, is it still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/618767 (owner: 10Volans) [13:55:03] (03PS1) 10Muehlenhoff: Add bast3006/bast4004/bast6002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878945 (https://phabricator.wikimedia.org/T324974) [13:55:04] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [13:55:20] (03Abandoned) 10Jbond: hieradata: add ASN name comments [puppet] - 10https://gerrit.wikimedia.org/r/753147 (owner: 10Jbond) [13:55:43] (03PS1) 10JMeybohm: staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729) [13:58:11] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1400). [14:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:32] o/ [14:01:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1001.eqiad.wmnet [14:02:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [14:02:22] * MichaelG_WMDE is here to [14:02:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1004.eqiad.wmnet with OS bullseye [14:02:27] *too [14:03:37] bleh [14:03:38] `scap backport 877983 877972` [14:03:52] “backport failed: Request Failed: https://gerrit.wikimedia.org/r/changes/Icfe7f38fdf9c3255d51713d3084593f880425d06/revisions/current/crd 404 Multiple changes found for Icfe7f38fdf9c3255d51713d3084593f880425d06” [14:04:04] if only I had specified change numbers, which are unique, instead of ambiguous change IDs……… [14:04:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/WikibaseLexeme] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877983 (https://phabricator.wikimedia.org/T326621) (owner: 10Lucas Werkmeister (WMDE)) [14:04:36] I think that's a (already reported) scap bug with dependencies accross multiple branches [14:04:44] looks like doing them one at a time will work, it’ll just take even longer in CI 🤷 [14:05:13] or you can +2 them manually [14:05:23] yeah I’ll do that in a few minutes [14:05:42] looks like https://phabricator.wikimedia.org/T323277 is the phab task [14:06:33] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bullseye [14:08:04] (03CR) 10Muehlenhoff: [C: 03+2] Add bast3006/bast4004/bast6002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878945 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [14:09:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "already +2ing to speed up backport later" [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE)) [14:10:22] !log installing postgresql 11 security updates on maps/eqiad [14:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1001.eqiad.wmnet [14:12:25] PROBLEM - puppet last run on puppetdb2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604986 seconds, message: maint, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:58] puppetdb2002 was me, fixing [14:14:39] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) [14:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:16:03] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:52] * MichaelG_WMDE is afk, but back quickly [14:18:01] RECOVERY - puppet last run on puppetdb2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:19:12] (03Merged) 10jenkins-bot: Fix test constructing HTMLFormField without parent [extensions/WikibaseLexeme] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877983 (https://phabricator.wikimedia.org/T326621) (owner: 10Lucas Werkmeister (WMDE)) [14:19:41] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:877983|Fix test constructing HTMLFormField without parent (T326621)]] [14:19:45] T326621: Wikibase\Lexeme\Tests\MediaWiki\Specials\HTMLForm\LemmaLanguageFieldTest::testValidateWithValidLanguageCodeReturnsTrue HTMLFormField::__construct: Constructing an HTMLFormField without a 'parent' parameter - https://phabricator.wikimedia.org/T326621 [14:21:28] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and lucaswerkmeister-wmde: Backport for [[gerrit:877983|Fix test constructing HTMLFormField without parent (T326621)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:22:03] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [14:22:11] * MichaelG_WMDE is back and looking at zuul [14:23:51] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:24:03] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [14:25:22] (03Merged) 10jenkins-bot: Add missing parentheses to vector search match text [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE)) [14:25:29] meh, it merged too soon [14:25:33] I’ll have to sync that one manually then [14:27:13] scap backport works just fine with already merged commits [14:27:18] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:20] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:877983|Fix test constructing HTMLFormField without parent (T326621)]] (duration: 08m 38s) [14:28:24] T326621: Wikibase\Lexeme\Tests\MediaWiki\Specials\HTMLForm\LemmaLanguageFieldTest::testValidateWithValidLanguageCodeReturnsTrue HTMLFormField::__construct: Constructing an HTMLFormField without a 'parent' parameter - https://phabricator.wikimedia.org/T326621 [14:28:55] taavi: I get the “multiple changes found” error again so I guess it’s still confused by the Depends-On [14:29:03] ah, hmm [14:29:06] (from just `scap backport 877972`) [14:29:53] pulled the Wikibase change to mwdebug1001 [14:30:03] should be testable on test wikidata [14:30:04] (03PS1) 10Jelto: sre.gitlab.upgrade: use url instead of hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/878953 (https://phabricator.wikimedia.org/T323569) [14:30:06] (cc MichaelG_WMDE) [14:30:18] * MichaelG_WMDE looks [14:30:37] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto) [14:30:43] looks good to me, I think [14:30:49] works for me! [14:31:43] I'm fine with this moving forward :) [14:32:18] syncing [14:32:27] (03CR) 10Volans: [C: 03+1] "makes sense" [cookbooks] - 10https://gerrit.wikimedia.org/r/878953 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:32:45] (with T326633 in the log message because I think it’s better than no task at all) [14:32:46] T326633: Monitor the deployment of the new Search on the 2022 version of the Vector skin - https://phabricator.wikimedia.org/T326633 [14:35:57] yep, I think that is what that task for, all the misc stuff [14:39:10] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:22] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.18/extensions/Wikibase/repo/resources/wikibase.vector.searchClient.js: Backport: [[gerrit:877972|Add missing parentheses to vector search match text (T326633)]] (1/2) (duration: 07m 09s) [14:39:25] T326633: Monitor the deployment of the new Search on the 2022 version of the Vector skin - https://phabricator.wikimedia.org/T326633 [14:39:58] and syncing the second file now [14:40:05] (just for consistency) [14:40:42] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use url instead of hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/878953 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:42:02] I can confirm that it now also works on test.wikidata without WikimediaDebug [14:42:12] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [14:42:48] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: use url instead of hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/878953 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:42:54] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:55] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10StephaneRebai) Thank you @Jelto i will verify access and close this when done [14:44:38] (03PS1) 10Hnowlan: thumbor: set maxSurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/878957 (https://phabricator.wikimedia.org/T233196) [14:44:49] yay [14:46:09] (03CR) 10Eevans: [C: 03+2] cassandra_dev: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878186 (owner: 10Eevans) [14:46:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.18/extensions/Wikibase/repo/tests/jest/wikibase.vector.searchClient.spec.js: Backport: [[gerrit:877972|Add missing parentheses to vector search match text (T326633)]] (2/2) (duration: 06m 46s) [14:46:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [14:46:41] T326633: Monitor the deployment of the new Search on the 2022 version of the Vector skin - https://phabricator.wikimedia.org/T326633 [14:46:46] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1005.eqiad.wmnet with OS bullseye [14:46:56] I don’t see anything else in the deployment calendar [14:47:01] !log UTC afternoon backport+config window done [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:50] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:49:01] (03CR) 10Eevans: [V: 03+2 C: 03+2] cassandra_dev: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878186 (owner: 10Eevans) [14:54:10] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:39] (03PS2) 10Eevans: Upgraded deployment-prep echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860941 [14:56:59] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:58:24] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10jbond) > With bullseye apt even does this automatically wonder if we could backport this to buster, ignore stretch and call it done? [14:58:54] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:37] (03CR) 10Krinkle: [C: 04-1] Start using the ClusterConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto) [15:04:29] (03PS1) 10Andrew Bogott: Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 [15:04:54] (03CR) 10CI reject: [V: 04-1] Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 (owner: 10Andrew Bogott) [15:06:44] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10MoritzMuehlenhoff) >>! In T277011#8516622, @jbond wrote: >> With bullseye apt even does this automatically > wonder if we could backport this to buster, ignore... [15:09:23] (03PS11) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [15:10:07] (03PS2) 10Muehlenhoff: package_builder::pbuilder_hook: Manage the hook directory with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878879 [15:10:38] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:00] (03PS12) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [15:12:03] (03PS1) 10Effie Mouzeli: P:memcached::memkeys: do not install memkeys if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) [15:12:45] (03CR) 10CI reject: [V: 04-1] P:memcached::memkeys: do not install memkeys if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli) [15:13:35] (03PS2) 10Effie Mouzeli: P:memcached::memkeys: do not install memkeys if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) [15:14:21] (03PS2) 10Andrew Bogott: Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 [15:14:44] (03CR) 10CI reject: [V: 04-1] Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 (owner: 10Andrew Bogott) [15:14:59] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10jbond) >>! In T277011#8516648, @MoritzMuehlenhoff wrote: >>>! In T277011#8516622, @jbond wrote: >>> With bullseye apt even does this automatically >> wonder if... [15:15:46] (03CR) 10Muehlenhoff: P:memcached::memkeys: do not install memkeys if on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli) [15:17:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P42982 and previous config saved to /var/cache/conftool/dbconfig/20230111-151712-marostegui.json [15:17:31] (03PS3) 10Effie Mouzeli: P:memcached::memkeys: do not install memkeys if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) [15:18:02] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878964 [15:18:47] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878964 (owner: 10Marostegui) [15:21:24] !log Stop mariadb on db1106 to reclone db1206 (there will be lag on s1 on wikireplicas) T326669 [15:21:25] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/878962/39081/" [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli) [15:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:28] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [15:23:42] (03CR) 10Marostegui: [C: 03+1] Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) (owner: 10Btullis) [15:27:02] (03PS1) 10Jbond: bgpalerter: add default monitors and reports [puppet] - 10https://gerrit.wikimedia.org/r/879046 [15:30:29] (03PS2) 10Muehlenhoff: postgresql::user: No longer compare password hashes on Bookworm and later [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) [15:31:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:32:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39083/console" [puppet] - 10https://gerrit.wikimedia.org/r/879046 (owner: 10Jbond) [15:32:27] (03CR) 10Jbond: [V: 03+1 C: 03+2] bgpalerter: add default monitors and reports [puppet] - 10https://gerrit.wikimedia.org/r/879046 (owner: 10Jbond) [15:32:37] jouncebot, nowandnext [15:32:37] No deployments scheduled for the next 2 hour(s) and 27 minute(s) [15:32:37] In 2 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1800) [15:33:30] (03CR) 10Zabe: [C: 03+2] Start reading from cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878870 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [15:34:25] (03Merged) 10jenkins-bot: Start reading from cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878870 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [15:34:52] !log zabe@deploy1002 Started scap: Backport for [[gerrit:878870|Start reading from cul_actor everywhere (T233004)]] [15:34:56] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [15:36:34] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:878870|Start reading from cul_actor everywhere (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [15:37:49] (ProbeDown) firing: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:49] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:57] (03PS1) 10Jbond: bgpalerter: use wss/https for websocket connection [puppet] - 10https://gerrit.wikimedia.org/r/879048 [15:38:17] zabe: something just went down [15:38:30] looking, got paged [15:38:34] (virtual-chassis crash) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [15:38:40] ncredir? this is bigger? [15:38:44] weird, we just got pages from eqsin [15:38:46] (ThanosSidecarBucketOperationsFailed) firing: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [15:38:52] !log zabe@deploy1002 sync-world aborted: Backport for [[gerrit:878870|Start reading from cul_actor everywhere (T233004)]] (duration: 04m 00s) [15:38:53] !log zabe@deploy1002 backport aborted: (duration: 04m 25s) [15:38:56] godog: dunno if related, but the faulty switch in eqsin just came back up [15:39:03] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:04] godog: fyi, deployment in progress [15:39:07] XioNoX: that would explain it [15:39:16] ah, eqsin, ok [15:39:18] but let's double check [15:39:23] could be yeah, I'll silence and ack the alerts [15:39:24] no user affected [15:39:32] godog: but it should't alert :) [15:39:36] as it's coming back up [15:39:50] XioNoX: I think it could happen because there was some uknowns [15:40:12] that become knowns, and then could wake up in bad order, but checking it is not something else [15:41:09] (03CR) 10Jbond: [C: 03+2] bgpalerter: use wss/https for websocket connection [puppet] - 10https://gerrit.wikimedia.org/r/879048 (owner: 10Jbond) [15:41:19] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) > Next step is onsite "format install" https://supportportal.juniper.net/s/article/EX-QFX-Procedure-to-format-install-QFX5K-device-using-a-USB?lang... [15:41:23] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:41:27] mmhh yeah not sure yet why ProbeDown notified tbh [15:41:38] but it recovered alright [15:41:59] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:42:03] I will remove the downtimes I added to make sure all services come back online [15:42:24] as in, "health checks happen correctly" [15:42:34] (Emergency syslog message) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:42:46] (JobUnavailable) firing: (24) Reduced availability for job bird in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:49] (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:42:49] (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:30] (03PS4) 10Effie Mouzeli: P:memcached::memkeys: install memkeys only if on buster [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) [15:43:43] (03PS1) 10Marostegui: db1206: Future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/879049 [15:43:46] (ThanosSidecarBucketOperationsFailed) resolved: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [15:44:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff) [15:44:45] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:45:00] (03CR) 10Eevans: [C: 03+2] Upgraded deployment-prep echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860941 (owner: 10Eevans) [15:45:09] (03CR) 10Marostegui: [C: 03+2] db1206: Future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/879049 (owner: 10Marostegui) [15:45:16] !log zabe@deploy1002 Started scap: T233004 [15:45:19] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [15:45:46] (03Merged) 10jenkins-bot: Upgraded deployment-prep echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860941 (owner: 10Eevans) [15:46:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli) [15:47:34] (Emergency syslog message) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:48:15] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:48:34] (virtual-chassis crash) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [15:48:48] (03CR) 10Effie Mouzeli: P:memcached::memkeys: install memkeys only if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli) [15:50:00] (03PS3) 10Andrew Bogott: Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 [15:50:27] (03PS1) 10Ottomata: flink - Add examples/wikimedia with simple table datagen -> print pipeline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/879050 (https://phabricator.wikimedia.org/T316519) [15:50:31] (03CR) 10Effie Mouzeli: [C: 03+2] P:memcached::memkeys: install memkeys only if on buster [puppet] - 10https://gerrit.wikimedia.org/r/878962 (https://phabricator.wikimedia.org/T228970) (owner: 10Effie Mouzeli) [15:51:35] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink - Add examples/wikimedia with simple table datagen -> print pipeline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/879050 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:53:08] (03CR) 10Andrew Bogott: [C: 03+2] Trove: Patch a trove/dns bug. [puppet] - 10https://gerrit.wikimedia.org/r/878960 (owner: 10Andrew Bogott) [15:53:10] !log zabe@deploy1002 Finished scap: T233004 (duration: 07m 54s) [15:53:14] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [15:54:31] (03PS3) 10Muehlenhoff: postgresql::user: No longer compare password hashes on Bookworm and later [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) [15:54:37] (03PS1) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) [15:55:33] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:56:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff) [15:56:27] (03CR) 10CI reject: [V: 04-1] Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [15:56:51] (03CR) 10Dzahn: [C: 03+2] scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn) [15:57:58] (03CR) 10Clément Goubert: [C: 03+1] "LGTM, we should probably add that (or a similar mechanism) by defaults in the service scaffolding." [deployment-charts] - 10https://gerrit.wikimedia.org/r/878957 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:58:35] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:58:52] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:00:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:21] (03CR) 10Papaul: [C: 03+2] admin: Add Jennifer Hancock to the datacenter-ops group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [16:00:39] (03PS2) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [16:00:55] (03PS1) 10Andrew Bogott: Replace 'yoga' with 'zed' in a zed manifest [puppet] - 10https://gerrit.wikimedia.org/r/879052 [16:01:50] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host mc1038.eqiad.wmnet with OS bullseye [16:01:56] (03CR) 10Muehlenhoff: postgresql::user: No longer compare password hashes on Bookworm and later (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff) [16:02:04] (03PS3) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [16:02:23] (03PS1) 10Jdrewniak: Fix mustache template rendering when TOC is rerendered after an edit [skins/Vector] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879093 (https://phabricator.wikimedia.org/T326682) [16:02:44] (03PS1) 10Jdrewniak: Fix mustache template rendering when TOC is rerendered after an edit [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879094 (https://phabricator.wikimedia.org/T326682) [16:03:14] !log volans@cumin1001 START - Cookbook sre.dns.netbox [16:04:20] (03PS2) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) [16:04:57] (03CR) 10CI reject: [V: 04-1] Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:05:06] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Force update after eqsin outage is over - volans@cumin1001" [16:05:35] (03CR) 10Andrew Bogott: [C: 03+2] Replace 'yoga' with 'zed' in a zed manifest [puppet] - 10https://gerrit.wikimedia.org/r/879052 (owner: 10Andrew Bogott) [16:05:53] (03PS1) 10Marostegui: add_cul_reason_id_T321391.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/879054 [16:06:10] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Force update after eqsin outage is over - volans@cumin1001" [16:06:10] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:06:51] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:07:55] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:10:30] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [16:10:49] (03CR) 10Ottomata: Add flink-app-example service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:12:56] (03PS5) 10JMeybohm: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) [16:12:58] (03PS2) 10JMeybohm: Pin coredns, eventrouter and helm-state-metrics for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878940 (https://phabricator.wikimedia.org/T307943) [16:13:00] (03PS5) 10JMeybohm: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729) [16:13:02] (03PS3) 10JMeybohm: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729) [16:13:04] (03PS2) 10JMeybohm: calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729) [16:13:06] (03PS2) 10JMeybohm: staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729) [16:13:24] (03CR) 10JMeybohm: [C: 03+2] Skip empty fixture files [deployment-charts] - 10https://gerrit.wikimedia.org/r/875947 (owner: 10JMeybohm) [16:15:45] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:07] (03CR) 10Lucas Werkmeister (WMDE): Enable the API on test-wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [16:16:31] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:01] (03PS2) 10Michael Große: Enable the REST API on test-wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) [16:17:35] (03CR) 10Michael Große: Enable the REST API on test-wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [16:18:38] (03Merged) 10jenkins-bot: Skip empty fixture files [deployment-charts] - 10https://gerrit.wikimedia.org/r/875947 (owner: 10JMeybohm) [16:18:39] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:19:21] (03PS3) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) [16:20:01] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:21:30] (03CR) 10JMeybohm: Add flink-app-example service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:22:00] !log dbmaint deploy schema change with replication on s6 eqiad T321391 [16:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:03] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [16:22:35] (03CR) 10Ayounsi: "Some outstanding questions:" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [16:23:13] (03PS1) 10Zabe: Start reading from cuc_actor on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879055 (https://phabricator.wikimedia.org/T233004) [16:24:03] (03CR) 10Ayounsi: Netbox: add support for central Redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [16:25:36] (03CR) 10Ayounsi: Netbox: add support for central Redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [16:25:52] !log dbmaint deploy schema change with replication on s8 eqiad T321391 [16:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:12] (03PS1) 10Zabe: Stop writing to cul_user and cul_user_text on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004) [16:28:14] (03PS4) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [16:28:33] (03CR) 10Ottomata: Add flink-app-example service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:30:47] (03CR) 10CI reject: [V: 04-1] Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:31:23] !log dbmaint deploy schema change with replication on s4 eqiad T321391 [16:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:27] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [16:33:11] (03PS5) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [16:35:15] (03CR) 10Ladsgroup: [C: 03+1] add_cul_reason_id_T321391.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/879054 (owner: 10Marostegui) [16:35:22] (03CR) 10Marostegui: [C: 03+2] add_cul_reason_id_T321391.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/879054 (owner: 10Marostegui) [16:35:49] (03Merged) 10jenkins-bot: add_cul_reason_id_T321391.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/879054 (owner: 10Marostegui) [16:35:59] (03CR) 10CI reject: [V: 04-1] Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:37:34] (Processor usage over 85%) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [16:38:49] !log dbmaint deploy schema change with replication on s5 eqiad T321391 [16:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:52] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [16:40:05] (03PS6) 10Ottomata: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [16:41:37] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [16:41:55] (03PS1) 10Jelto: sre.gitlab.upgrade: use spicerack reason instead of string [cookbooks] - 10https://gerrit.wikimedia.org/r/879059 (https://phabricator.wikimedia.org/T323569) [16:41:57] (03PS1) 10Jelto: sre.gitlab.upgrade: check for high threshold in fail_for_disk_space [cookbooks] - 10https://gerrit.wikimedia.org/r/879060 (https://phabricator.wikimedia.org/T323569) [16:42:14] (03CR) 10JMeybohm: [C: 03+2] Pin coredns, eventrouter and helm-state-metrics for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878940 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:42:28] (03CR) 10JMeybohm: [C: 03+2] Add istio config for main/wikikube clusters on k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878190 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [16:42:36] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/879059 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:43:15] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/879060 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:44:24] (03PS7) 10Ottomata: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [16:45:52] (03PS8) 10Ottomata: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [16:46:01] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: check for high threshold in fail_for_disk_space [cookbooks] - 10https://gerrit.wikimedia.org/r/879060 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:46:03] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use spicerack reason instead of string [cookbooks] - 10https://gerrit.wikimedia.org/r/879059 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:46:30] (03CR) 10Effie Mouzeli: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [16:47:09] (03Merged) 10jenkins-bot: Add istio config for main/wikikube clusters on k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878190 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [16:47:17] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:50] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: use spicerack reason instead of string [cookbooks] - 10https://gerrit.wikimedia.org/r/879059 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:47:57] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: check for high threshold in fail_for_disk_space [cookbooks] - 10https://gerrit.wikimedia.org/r/879060 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:51:43] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:52:34] (Processor usage over 85%) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [16:53:17] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:53:54] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) 05Stalled→03Resolved That's all done. [16:54:01] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [16:54:19] (03PS1) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/879095 [16:54:23] (03CR) 10Marostegui: [C: 04-2] "Not ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/879095 (owner: 10Marostegui) [16:54:27] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [16:55:09] (03CR) 10Btullis: [C: 03+1] "This looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:56:45] (03CR) 10BCornwall: [V: 03+1] varnish: Template out thread pool settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [16:57:07] (03Merged) 10jenkins-bot: cert-manager: Allow leader election to write configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/878751 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [16:57:10] (03Merged) 10jenkins-bot: Pin coredns, eventrouter and helm-state-metrics for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878940 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:58:47] (03PS6) 10JMeybohm: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729) [16:58:49] (03PS4) 10JMeybohm: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729) [16:58:51] (03PS3) 10JMeybohm: calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729) [16:58:53] (03PS3) 10JMeybohm: staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729) [16:59:58] (03CR) 10Btullis: [C: 03+2] Detect the correct disks for the O/S on the cephosd servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [17:00:15] (03CR) 10Btullis: [C: 03+2] Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) (owner: 10Btullis) [17:00:43] (03PS3) 10Btullis: Allow the wikireplicas.add-wiki cookbook to replace existing views [cookbooks] - 10https://gerrit.wikimedia.org/r/878930 (https://phabricator.wikimedia.org/T310872) [17:03:24] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:03:32] (03PS5) 10BCornwall: varnish: Template out thread pool settings [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) [17:03:46] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:04:10] !log dbmaint deploy schema change with replication on s7 eqiad T321391 [17:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:13] (03PS9) 10Ottomata: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [17:04:13] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [17:06:00] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39090/console" [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [17:08:22] (03CR) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/879095 (owner: 10Marostegui) [17:08:22] 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Jelto) [17:08:27] (03CR) 10Marostegui: [C: 03+2] Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/879095 (owner: 10Marostegui) [17:09:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [17:10:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [17:10:12] (03CR) 10JMeybohm: [C: 03+2] staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:10:15] (03CR) 10JMeybohm: [C: 03+2] calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:10:17] (03CR) 10JMeybohm: [C: 03+2] Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:10:21] (03CR) 10JMeybohm: [C: 03+2] coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:10:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 1%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42987 and previous config saved to /var/cache/conftool/dbconfig/20230111-171021-root.json [17:10:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [17:10:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [17:10:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [17:11:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [17:11:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2112 (T321391)', diff saved to https://phabricator.wikimedia.org/P42988 and previous config saved to /var/cache/conftool/dbconfig/20230111-171114-marostegui.json [17:11:18] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [17:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T321391)', diff saved to https://phabricator.wikimedia.org/P42989 and previous config saved to /var/cache/conftool/dbconfig/20230111-171338-marostegui.json [17:14:59] (03CR) 10JMeybohm: [C: 03+1] Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:15:40] (03Merged) 10jenkins-bot: coredns: Remove deprecated nodeSelector, kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878935 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:15:42] (03Merged) 10jenkins-bot: Remove kubernetesApi hack from helm-state-metrics and eventrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/878941 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:15:44] (03Merged) 10jenkins-bot: calico: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878943 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:15:46] (03Merged) 10jenkins-bot: staging-codfw: Remove kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/878946 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:17:19] (03CR) 10Hnowlan: [C: 03+2] thumbor: set maxSurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/878957 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:18:29] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:18:34] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:20:37] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:21:02] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:21:13] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:21:29] (03PS5) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [17:21:32] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:23:32] (03Merged) 10jenkins-bot: thumbor: set maxSurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/878957 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:23:53] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39091/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:25:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 5%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42991 and previous config saved to /var/cache/conftool/dbconfig/20230111-172526-root.json [17:28:24] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:28:35] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:28:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P42992 and previous config saved to /var/cache/conftool/dbconfig/20230111-172844-marostegui.json [17:29:16] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [17:30:54] (03PS1) 10JMeybohm: staging-codfw: Unpin eventrouter, helm-state-metrics, coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879063 (https://phabricator.wikimedia.org/T326340) [17:31:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff) [17:36:55] (03CR) 10JMeybohm: [C: 03+2] staging-codfw: Unpin eventrouter, helm-state-metrics, coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879063 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [17:37:01] (03PS1) 10BCornwall: Remove all legacy_vip entries [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) [17:39:48] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [17:40:00] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [17:40:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 10%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42993 and previous config saved to /var/cache/conftool/dbconfig/20230111-174031-root.json [17:42:15] (03Merged) 10jenkins-bot: staging-codfw: Unpin eventrouter, helm-state-metrics, coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879063 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [17:42:17] (03CR) 10Dzahn: [C: 03+2] statistics: assert new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn) [17:42:23] (03PS3) 10Dzahn: statistics: assert new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276 [17:42:39] (03CR) 10Dzahn: statistics: assert new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn) [17:42:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:43:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P42994 and previous config saved to /var/cache/conftool/dbconfig/20230111-174351-marostegui.json [17:47:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:53:24] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:53:24] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [17:54:09] (03PS6) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [17:55:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42995 and previous config saved to /var/cache/conftool/dbconfig/20230111-175536-root.json [17:55:57] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-worker[1080,1084].eqiad.wmnet with reason: Shutting down to enable RAID battery replacement [17:56:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1080,1084].eqiad.wmnet with reason: Shutting down to enable RAID battery replacement [17:56:16] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=af7b1865-a9a0-44ba-aa68-9f34812e0d62) set by btullis@cumin1001 for 7 days, 0:... [17:56:25] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39092/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:57:07] 10SRE, 10Traffic: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall) [17:57:22] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10BTullis) Thanks @jcrespo - I've added another 7 days downtime. @Jclark-ctr any idea when you might be able to fit in this battery replacement... [17:57:27] 10SRE, 10Traffic: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall) p:05Triage→03Low [17:57:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:58:04] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [17:58:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T321391)', diff saved to https://phabricator.wikimedia.org/P42996 and previous config saved to /var/cache/conftool/dbconfig/20230111-175857-marostegui.json [17:59:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [17:59:01] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [17:59:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [17:59:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T321391)', diff saved to https://phabricator.wikimedia.org/P42997 and previous config saved to /var/cache/conftool/dbconfig/20230111-175919-marostegui.json [17:59:41] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1800) [18:01:14] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [18:01:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321391)', diff saved to https://phabricator.wikimedia.org/P42998 and previous config saved to /var/cache/conftool/dbconfig/20230111-180142-marostegui.json [18:02:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:02:55] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:04:33] 10SRE, 10Traffic: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall) [18:05:21] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39093/console" [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall) [18:06:04] (03PS1) 10BBlack: Revert "depool eqsin for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/879111 [18:07:51] (03PS1) 10JMeybohm: admin_ng: Don't pin image version of coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879112 (https://phabricator.wikimedia.org/T326340) [18:07:57] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:08:10] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:08:17] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39094/console" [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall) [18:08:59] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:09:26] (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [18:09:31] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:09:46] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [18:10:20] (03CR) 10BBlack: [C: 03+1] "Thanks for chasing all this down, nice result!" [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall) [18:10:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42999 and previous config saved to /var/cache/conftool/dbconfig/20230111-181041-root.json [18:10:42] (03CR) 10Ottomata: [C: 03+2] Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:12:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:13:42] (03CR) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [18:14:48] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:15:47] (03Merged) 10jenkins-bot: Add flink-app-example service in the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:16:03] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39095/console" [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall) [18:16:17] (03PS1) 10Majavah: P:openstack::galera: add missing @resolve [puppet] - 10https://gerrit.wikimedia.org/r/879115 [18:16:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P43000 and previous config saved to /var/cache/conftool/dbconfig/20230111-181648-marostegui.json [18:17:26] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove all legacy_vip entries [puppet] - 10https://gerrit.wikimedia.org/r/879107 (https://phabricator.wikimedia.org/T239993) (owner: 10BCornwall) [18:20:12] (03CR) 10Ssingh: "PCC looks good! See inline comments once before we merge this" [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [18:20:39] (03PS1) 10Ottomata: flink-app-example - use correct patch to kubeconfig file in stream-enricnment-poc [deployment-charts] - 10https://gerrit.wikimedia.org/r/879116 (https://phabricator.wikimedia.org/T324576) [18:21:25] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::galera: add missing @resolve [puppet] - 10https://gerrit.wikimedia.org/r/879115 (owner: 10Majavah) [18:21:47] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Don't pin image version of coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879112 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [18:22:19] (03CR) 10Dzahn: [C: 03+2] "noop on stat1004, an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn) [18:22:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:22:49] !log btullis@cumin1001 Added views for new wiki: blkwiki T310872 [18:22:50] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [18:22:53] T310872: Prepare and check storage layer for blkwiki - https://phabricator.wikimedia.org/T310872 [18:24:37] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Holiday Leftovers 🥡), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10thcipriani) [18:25:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P43001 and previous config saved to /var/cache/conftool/dbconfig/20230111-182546-root.json [18:25:53] (03Abandoned) 10BBlack: lvs recdns: remove legacy IP definition, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/556178 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [18:25:59] (03Abandoned) 10BBlack: lvs recdns: remove legacy IP definition, step 2 [puppet] - 10https://gerrit.wikimedia.org/r/556179 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [18:26:40] (03Merged) 10jenkins-bot: admin_ng: Don't pin image version of coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/879112 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [18:27:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:27:59] (03CR) 10BBlack: [C: 03+2] Revert "depool eqsin for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/879111 (owner: 10BBlack) [18:28:13] !log repool eqsin edge DC [18:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:38] (03CR) 10Ottomata: [C: 03+2] flink-app-example - use correct patch to kubeconfig file in stream-enricnment-poc [deployment-charts] - 10https://gerrit.wikimedia.org/r/879116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:30:04] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:30:55] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:31:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P43002 and previous config saved to /var/cache/conftool/dbconfig/20230111-183155-marostegui.json [18:32:39] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:33:29] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:33:54] PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:54] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@5a19b9d]: drop-snapshots: Accept snapshot= partition from any level [18:35:25] (03Merged) 10jenkins-bot: flink-app-example - use correct patch to kubeconfig file in stream-enricnment-poc [deployment-charts] - 10https://gerrit.wikimedia.org/r/879116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:36:27] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@5a19b9d]: drop-snapshots: Accept snapshot= partition from any level (duration: 02m 33s) [18:37:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:40:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P43003 and previous config saved to /var/cache/conftool/dbconfig/20230111-184051-root.json [18:42:18] (03PS1) 10Jbond: bgpalerter: 1/2 a day to track down a missing 's' :@ [puppet] - 10https://gerrit.wikimedia.org/r/879119 [18:42:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:43:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39096/console" [puppet] - 10https://gerrit.wikimedia.org/r/879119 (owner: 10Jbond) [18:45:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] bgpalerter: 1/2 a day to track down a missing 's' :@ [puppet] - 10https://gerrit.wikimedia.org/r/879119 (owner: 10Jbond) [18:47:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321391)', diff saved to https://phabricator.wikimedia.org/P43004 and previous config saved to /var/cache/conftool/dbconfig/20230111-184701-marostegui.json [18:47:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [18:47:06] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [18:47:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [18:47:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T321391)', diff saved to https://phabricator.wikimedia.org/P43005 and previous config saved to /var/cache/conftool/dbconfig/20230111-184723-marostegui.json [18:47:53] !log dbmaint deploy schema change with replication on s2 eqiad T321391 [18:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321391)', diff saved to https://phabricator.wikimedia.org/P43006 and previous config saved to /var/cache/conftool/dbconfig/20230111-184946-marostegui.json [18:51:20] (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [18:52:20] !log Removing legacy vips from dns servers - T239993 [18:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:23] T239993: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 [18:52:27] (03PS1) 10Jdlrobson: Enable page tools on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879121 [18:52:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:34] (03CR) 10Dzahn: "Majavah, I think you were technically a contributor in git log, if you agree then this is 100%" [puppet] - 10https://gerrit.wikimedia.org/r/878205 (owner: 10Dzahn) [18:56:13] (03CR) 10Majavah: [C: 03+1] "don't remember what I did here, but sure" [puppet] - 10https://gerrit.wikimedia.org/r/878205 (owner: 10Dzahn) [18:57:31] !log dbmaint deploy schema change with replication on s3 eqiad T321391 [18:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:35] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [18:57:38] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:59:00] (03CR) 10Dzahn: [C: 03+2] "see https://gerrit.wikimedia.org/r/c/operations/puppet/+/815290 f.e. , thanks" [puppet] - 10https://gerrit.wikimedia.org/r/878205 (owner: 10Dzahn) [19:00:05] jeena and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1900). [19:00:05] jeena and dduvall: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T1900). [19:00:29] brett: multi-merge on puppetmaster, but mine is "add license headers" and yours is just "slightly" more risky with "remove VIP from DNS server".. so it's all yours :o [19:00:48] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:00:54] got it, thanks! [19:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43007 and previous config saved to /var/cache/conftool/dbconfig/20230111-190111-root.json [19:04:47] (03PS1) 10JMeybohm: admin_ng RBAC: Permit deploy users to interact with more resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/879122 (https://phabricator.wikimedia.org/T324576) [19:04:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P43008 and previous config saved to /var/cache/conftool/dbconfig/20230111-190453-marostegui.json [19:06:10] (03CR) 10JMeybohm: "If this works, we should probably if-guard the other CRD permissions as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/879122 (https://phabricator.wikimedia.org/T324576) (owner: 10JMeybohm) [19:07:23] (03CR) 10Dzahn: "@Papaul, Jennifer is now twice in the admin module, can you please remove her from the "ldap_only" section" [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [19:10:04] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:10:18] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:52] train is blocked, will resume after https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/879094/ has had QA and backport [19:11:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:12:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:13:17] (03CR) 10JMeybohm: [C: 03+2] admin_ng RBAC: Permit deploy users to interact with more resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/879122 (https://phabricator.wikimedia.org/T324576) (owner: 10JMeybohm) [19:15:53] (03PS1) 10Dzahn: librenms: assert data type for globally reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/879123 [19:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43009 and previous config saved to /var/cache/conftool/dbconfig/20230111-191616-root.json [19:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:17:58] (03Merged) 10jenkins-bot: admin_ng RBAC: Permit deploy users to interact with more resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/879122 (https://phabricator.wikimedia.org/T324576) (owner: 10JMeybohm) [19:19:18] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:19:28] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:20:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P43010 and previous config saved to /var/cache/conftool/dbconfig/20230111-192000-marostegui.json [19:20:44] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [19:20:50] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [19:24:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:24:48] PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:23] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [19:27:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:29:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:30:23] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [19:31:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43011 and previous config saved to /var/cache/conftool/dbconfig/20230111-193121-root.json [19:32:23] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/879123/39097/netmon1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/879123 (owner: 10Dzahn) [19:32:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321391)', diff saved to https://phabricator.wikimedia.org/P43012 and previous config saved to /var/cache/conftool/dbconfig/20230111-193506-marostegui.json [19:35:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [19:35:11] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [19:35:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [19:35:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [19:35:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [19:36:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T321391)', diff saved to https://phabricator.wikimedia.org/P43013 and previous config saved to /var/cache/conftool/dbconfig/20230111-193601-marostegui.json [19:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:38:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321391)', diff saved to https://phabricator.wikimedia.org/P43014 and previous config saved to /var/cache/conftool/dbconfig/20230111-193825-marostegui.json [19:38:52] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/879123 (owner: 10Dzahn) [19:39:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] librenms: assert data type for globally reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/879123 (owner: 10Dzahn) [19:41:56] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on netmon1003" [puppet] - 10https://gerrit.wikimedia.org/r/879123 (owner: 10Dzahn) [19:42:42] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Andrew) via elimination I've convinced myself that the issue here is 10_dumps_rsyncd : ` # Autogener... [19:45:40] (03CR) 10Dzahn: "meanwhile I have mailed the affcom list and they confirmed they are working on it - on hold" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [19:46:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43015 and previous config saved to /var/cache/conftool/dbconfig/20230111-194626-root.json [19:51:04] (03PS3) 10Dzahn: phabricator: rewrite https://phabricator.wikimedia.org/r/ to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/869853 (https://phabricator.wikimedia.org/T324311) [19:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:53:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P43016 and previous config saved to /var/cache/conftool/dbconfig/20230111-195332-marostegui.json [20:01:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43017 and previous config saved to /var/cache/conftool/dbconfig/20230111-200131-root.json [20:02:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:08:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P43018 and previous config saved to /var/cache/conftool/dbconfig/20230111-200838-marostegui.json [20:12:24] (03PS1) 10Bartosz Dziewoński: Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879099 (https://phabricator.wikimedia.org/T301063) [20:12:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:16:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43019 and previous config saved to /var/cache/conftool/dbconfig/20230111-201636-root.json [20:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:18:23] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BCornwall) 05Open→03Resolved @ayounsi Thanks for the detailed explanation and the help! I've removed the legacy_vip stuff from puppet, rolled it out, and deleted the IPs from the servers. I've als... [20:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:19:02] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Andrew) The troublesome entries are: ` ftp.acc.umu.se mirror.accum.se ftp.acc.umu.se mirror.accum.se `... [20:20:03] (03CR) 10Bartosz Dziewoński: "(There was a merge conflict because https://gerrit.wikimedia.org/r/c/mediawiki/core/+/876270 isn't present in wmf.17)" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński) [20:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:23:14] jeena, are you currently busy or may I slide in a config change? [20:23:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321391)', diff saved to https://phabricator.wikimedia.org/P43020 and previous config saved to /var/cache/conftool/dbconfig/20230111-202345-marostegui.json [20:23:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [20:23:49] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [20:23:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:23:58] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10RobH) [20:24:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [20:24:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T321391)', diff saved to https://phabricator.wikimedia.org/P43021 and previous config saved to /var/cache/conftool/dbconfig/20230111-202417-marostegui.json [20:26:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321391)', diff saved to https://phabricator.wikimedia.org/P43022 and previous config saved to /var/cache/conftool/dbconfig/20230111-202641-marostegui.json [20:27:42] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Andrew) I don't see any real problem with those hosts other than that they're duplicates of each other.... [20:31:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: After being recloned', diff saved to https://phabricator.wikimedia.org/P43023 and previous config saved to /var/cache/conftool/dbconfig/20230111-203141-root.json [20:32:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:36:12] (03CR) 10CI reject: [V: 04-1] Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński) [20:37:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) Fixed [20:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:39:17] (03PS1) 10Bartosz Dziewoński: Fix phan error when Excimer is enabled [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879100 [20:39:37] (03PS3) 10Bartosz Dziewoński: Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) [20:39:57] (03CR) 10Bartosz Dziewoński: "(New test failure is unrelated to the change, will be fixed by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/879100)" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński) [20:40:40] (03PS1) 10Papaul: Remove Jennifer from the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/879131 (https://phabricator.wikimedia.org/T326649) [20:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P43024 and previous config saved to /var/cache/conftool/dbconfig/20230111-204147-marostegui.json [20:47:01] (03CR) 10Marostegui: [C: 03+1] Remove Jennifer from the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/879131 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [20:47:45] (03CR) 10Marostegui: [C: 03+2] Remove Jennifer from the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/879131 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [20:47:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:50:38] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) @Andrew Is it not maybe 65.19.157.35 ? Because that is the only IP in there and it fails to reso... [20:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:56:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P43025 and previous config saved to /var/cache/conftool/dbconfig/20230111-205654-marostegui.json [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T2100). [21:00:05] jan_drewniak and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:17] I can deploy [21:01:49] hi [21:01:56] o/ [21:02:24] Great. Give me just a moment. We'll start with jan_drewniak [21:02:33] jan_drewniak: do you want to backport to wmf.17 too, or only wmf.18? i see you have a backport patch but it's not listed on the calendar [21:03:34] MatmaRex: yeah I made two but I think only wmf.18 is required [21:05:47] jan_drewniak: I'm going to sync both of yours at the same time since one is going to wmf.18 and the other to beta. [21:06:05] !log start UTC late backport window [21:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:07] that's fine with me [21:07:29] OK, starting merge. [21:07:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:07:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879094 (https://phabricator.wikimedia.org/T326682) (owner: 10Jdrewniak) [21:07:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879121 (owner: 10Jdlrobson) [21:08:03] (03PS6) 10BCornwall: varnish: Template out thread pool settings [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) [21:08:32] (03Merged) 10jenkins-bot: Enable page tools on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879121 (owner: 10Jdlrobson) [21:09:16] (03CR) 10BCornwall: varnish: Template out thread pool settings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [21:09:29] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39098/console" [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [21:12:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321391)', diff saved to https://phabricator.wikimedia.org/P43027 and previous config saved to /var/cache/conftool/dbconfig/20230111-211200-marostegui.json [21:12:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [21:12:05] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [21:12:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [21:12:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T321391)', diff saved to https://phabricator.wikimedia.org/P43028 and previous config saved to /var/cache/conftool/dbconfig/20230111-211222-marostegui.json [21:12:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:13:19] (03CR) 10Dzahn: [C: 03+1] Remove Jennifer from the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/879131 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [21:14:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321391)', diff saved to https://phabricator.wikimedia.org/P43029 and previous config saved to /var/cache/conftool/dbconfig/20230111-211445-marostegui.json [21:15:57] (03PS1) 10Dzahn: phabricator: add test for /r/ redirect to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/879137 (https://phabricator.wikimedia.org/T324311) [21:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:23:17] (03Merged) 10jenkins-bot: Fix mustache template rendering when TOC is rerendered after an edit [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879094 (https://phabricator.wikimedia.org/T326682) (owner: 10Jdrewniak) [21:23:43] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:879094|Fix mustache template rendering when TOC is rerendered after an edit (T326682)]], [[gerrit:879121|Enable page tools on beta cluster]] [21:23:47] T326682: [Regression, production] Vector 2022 TOC disappears, becomes "undefined" after saving an edit with DiscussionTools, VisualEditor - https://phabricator.wikimedia.org/T326682 [21:25:27] !log kindrobot@deploy1002 kindrobot and jdrewniak and jdlrobson: Backport for [[gerrit:879094|Fix mustache template rendering when TOC is rerendered after an edit (T326682)]], [[gerrit:879121|Enable page tools on beta cluster]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:25:53] jan_drewniak: could you please confirm your wmf.18 patch? [21:27:22] kindrobot: yup, looks good! [21:27:47] Great. Syncing. [21:28:22] kindrobot: considering that the process seems to be taking a long time today, how about doing my backports all at once? [21:28:54] That's fine with me. [21:29:46] note that there's a dependency between the two wmf.17 patches, but i think that's fine [21:29:48] thanks [21:29:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P43030 and previous config saved to /var/cache/conftool/dbconfig/20230111-212952-marostegui.json [21:30:28] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 129 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:32:46] Will you be able to test out your wmf.17 patches as opposed to your wmf.18 patches on the test servers if they're deployed together? [21:33:04] MatmaRex ^ [21:33:38] yeah [21:33:58] we have wikis on both .17 and .18, right? [21:34:01] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:879094|Fix mustache template rendering when TOC is rerendered after an edit (T326682)]], [[gerrit:879121|Enable page tools on beta cluster]] (duration: 10m 17s) [21:34:04] T326682: [Regression, production] Vector 2022 TOC disappears, becomes "undefined" after saving an edit with DiscussionTools, VisualEditor - https://phabricator.wikimedia.org/T326682 [21:34:13] group0 is on .18 [21:34:24] Ah, OK. [21:34:31] i don't need to make any edits to test these, so i can just test on wikipedias [21:35:39] OK, MatmaRex. I'm getting ready to start your merges. [21:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:38:20] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 133 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:38:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878154 (owner: 10Bartosz Dziewoński) [21:38:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879100 (owner: 10Bartosz Dziewoński) [21:38:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński) [21:38:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879099 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński) [21:39:06] btw you should be able to just provide all the change numbers to scap backport for merging/deploy [21:39:35] oh as you did :P [21:40:08] :) [21:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P43031 and previous config saved to /var/cache/conftool/dbconfig/20230111-214458-marostegui.json [21:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:57:12] (03Merged) 10jenkins-bot: Fix exception in `` with missing images [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/878154 (owner: 10Bartosz Dziewoński) [21:57:18] (03Merged) 10jenkins-bot: Fix phan error when Excimer is enabled [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879100 (owner: 10Bartosz Dziewoński) [21:57:27] (03Merged) 10jenkins-bot: Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879098 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński) [21:57:31] (03Merged) 10jenkins-bot: Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879099 (https://phabricator.wikimedia.org/T301063) (owner: 10Bartosz Dziewoński) [21:58:01] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:878154|Fix exception in `` with missing images]], [[gerrit:879100|Fix phan error when Excimer is enabled]], [[gerrit:879098|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T301063 T326399)]], [[gerrit:879099|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T301063 [21:58:01] T326399)]] [21:58:05] T301063: The "tag name" on the change line should link directly to "tagged changes" - https://phabricator.wikimedia.org/T301063 [21:58:06] T326399: (other edits) links repetitive and long - https://phabricator.wikimedia.org/T326399 [21:58:12] i can test things whenever they're on mwdebug [21:58:42] Great. Should be soon. I'll ping you. [22:00:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321391)', diff saved to https://phabricator.wikimedia.org/P43033 and previous config saved to /var/cache/conftool/dbconfig/20230111-220005-marostegui.json [22:00:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [22:00:10] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [22:00:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [22:00:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43034 and previous config saved to /var/cache/conftool/dbconfig/20230111-220026-marostegui.json [22:01:56] (03Abandoned) 10Bartosz Dziewoński: Fix mustache template rendering when TOC is rerendered after an edit [skins/Vector] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/879093 (https://phabricator.wikimedia.org/T326682) (owner: 10Jdrewniak) [22:02:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43035 and previous config saved to /var/cache/conftool/dbconfig/20230111-220251-marostegui.json [22:03:35] 10SRE, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) Alright, finally getting back to this. So the request is that the group "deployment", which is already on the canary_appserver role on mwdebug hosts... [22:07:28] Sorry it's taking so long. Not sure what's holding it up. [22:08:12] what's the last output you got? [22:10:43] jeena: sorry my tmux session got weird [22:11:03] It's actually is making more progress. [22:11:08] oh good [22:11:19] It was on K8s image build/push [22:11:29] Now it's on sync-masters [22:11:32] ah yeah that can take a while [22:12:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:13:01] Does this link work for yall: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.7.0-5-2023.02?id=WyXhooUBPP0fLdos6gAX [22:13:46] (03PS1) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) [22:13:46] zabe: sorry I did not see your message until now for some reason. If you still want to add your config change after backports are done and before I deploy today that would be fine with me [22:14:02] Yes, I can see it dancy. [22:14:03] yeah that would be cool [22:14:11] (03CR) 10CI reject: [V: 04-1] admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [22:14:20] dancy, works for me [22:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:15:42] thx [22:16:59] (03PS2) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) [22:17:18] https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.7.0-5-2023.02?id=MyHYooUBPP0fLdoscyIC [22:17:33] 476 l10n files rebuilt. [22:17:37] (03CR) 10CI reject: [V: 04-1] admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [22:17:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P43036 and previous config saved to /var/cache/conftool/dbconfig/20230111-221757-marostegui.json [22:17:59] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) @daniel This was for you, remember that? [22:18:54] Looks like it was a fresh checkout of wmf.17? Jeena does that track? [22:19:14] oh? that seems weird [22:19:16] oh I may be misinterpreting a message.. disregard. [22:19:29] ah yes, it always says "successfully checked out" nvm. [22:19:36] phew lol [22:19:39] (03PS3) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) [22:21:24] (03PS1) 10Zabe: Start writing to rev_comment_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954) [22:21:37] !log kindrobot@deploy1002 kindrobot and matmarex: Backport for [[gerrit:878154|Fix exception in `` with missing images]], [[gerrit:879100|Fix phan error when Excimer is enabled]], [[gerrit:879098|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T301063 T326399)]], [[gerrit:879099|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view [22:21:38] " (T301063 T326399)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [22:21:42] T301063: The "tag name" on the change line should link directly to "tagged changes" - https://phabricator.wikimedia.org/T301063 [22:21:42] T326399: (other edits) links repetitive and long - https://phabricator.wikimedia.org/T326399 [22:22:07] MatmaRex: we made it! Could you please confirm? [22:22:14] hey deployers, have you ever thought "I wish I could temp disable puppet on mwdebug" ? [22:22:21] looking [22:22:22] I know I have been asked about it [22:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:23:01] kindrobot: everything looks good [22:23:34] OK great. Syncing. [22:23:59] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/879098 changes l10n files.. so it's when that was backported that I would expect to see the l10n rebuild happen. [22:24:05] mutante: are you making that happen? :P personally I have not considered it but I am probably an outlier [22:24:30] jeena: yes:) I am trying to make that happen. https://gerrit.wikimedia.org/r/c/operations/puppet/+/879147 [22:24:42] a ticket that's been sitting there for a while [22:25:20] cool! [22:26:32] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) [22:27:49] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) [22:27:58] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) >>! In T305979#7976119, @MoritzMuehlenhoff wrote: > This was discussed in the Infrastructure Foundation... [22:28:02] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) a:05Dzahn→03None [22:29:43] (03CR) 10Dzahn: "Alex, Effie, I almost forgot entirely about this. Does it make sense to keep it open or is this one of those cases where thumbor is moving" [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [22:31:10] 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10lmata) [22:31:38] (03CR) 10Dzahn: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [22:32:46] PROBLEM - Check systemd state on people2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:33:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P43037 and previous config saved to /var/cache/conftool/dbconfig/20230111-223304-marostegui.json [22:35:08] How did this person make this video? Is the camera strapped to their head? [22:37:33] Ooops, wrong channel x_x;; [22:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:38:07] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:878154|Fix exception in `` with missing images]], [[gerrit:879100|Fix phan error when Excimer is enabled]], [[gerrit:879098|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T301063 T326399)]], [[gerrit:879099|Revert "ChangeTags: When showing a tag, also link to a filtered RecentChanges view" (T30106 [22:38:07] 3 T326399)]] (duration: 40m 05s) [22:38:09] I'm not watching youtube while the deploy is finishing. ;) [22:38:13] T301063: The "tag name" on the change line should link directly to "tagged changes" - https://phabricator.wikimedia.org/T301063 [22:38:13] T326399: (other edits) links repetitive and long - https://phabricator.wikimedia.org/T326399 [22:38:16] T30106: Problem with port setting when web server is behind NAT - https://phabricator.wikimedia.org/T30106 [22:38:57] Speaking of, the deploy just finished. Thank you jan_drewniak and MatmaRex. Sorry that took so long. [22:39:07] (03PS5) 10Dzahn: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [22:39:10] thanks! [22:39:10] !log close UTC late backport window [22:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:27] (03CR) 10CI reject: [V: 04-1] phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [22:39:35] (03CR) 10Dzahn: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [22:40:01] jeena, I would quickly push through my config changes if that is still fine with this [22:40:18] Yup, lmk if you need anything from me zabe [22:40:28] (03CR) 10Zabe: [C: 03+2] Start reading from cuc_actor on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879055 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:40:36] (03PS2) 10Zabe: Start writing to rev_comment_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954) [22:40:41] (03CR) 10Zabe: [C: 03+2] Start writing to rev_comment_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:40:43] !log upload memkeys_20181031-2~bullseye0_ on bullseye-wikimedia [22:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:48] (03PS2) 10Zabe: Stop writing to cul_user and cul_user_text on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004) [22:40:53] (03CR) 10Zabe: [C: 03+2] Stop writing to cul_user and cul_user_text on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:41:14] (03Merged) 10jenkins-bot: Start reading from cuc_actor on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879055 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:41:27] (03PS6) 10Dzahn: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [22:41:29] (03Merged) 10jenkins-bot: Start writing to rev_comment_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:41:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879148 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:41:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:41:35] (03Merged) 10jenkins-bot: Stop writing to cul_user and cul_user_text on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879057 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:41:52] (03CR) 10CI reject: [V: 04-1] phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [22:42:12] !log zabe@deploy1002 Started scap: Backport for [[gerrit:879055|Start reading from cuc_actor on group0 and group1 wikis (T233004)]], [[gerrit:879148|Start writing to rev_comment_id on group0 wikis (T299954)]], [[gerrit:879057|Stop writing to cul_user and cul_user_text on testwiki (T233004)]] [22:42:17] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:42:18] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:43:58] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:879055|Start reading from cuc_actor on group0 and group1 wikis (T233004)]], [[gerrit:879148|Start writing to rev_comment_id on group0 wikis (T299954)]], [[gerrit:879057|Stop writing to cul_user and cul_user_text on testwiki (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [22:44:06] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43038 and previous config saved to /var/cache/conftool/dbconfig/20230111-224810-marostegui.json [22:48:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:48:15] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [22:48:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:48:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43039 and previous config saved to /var/cache/conftool/dbconfig/20230111-224832-marostegui.json [22:50:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43040 and previous config saved to /var/cache/conftool/dbconfig/20230111-225056-marostegui.json [22:51:40] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:879055|Start reading from cuc_actor on group0 and group1 wikis (T233004)]], [[gerrit:879148|Start writing to rev_comment_id on group0 wikis (T299954)]], [[gerrit:879057|Stop writing to cul_user and cul_user_text on testwiki (T233004)]] (duration: 09m 28s) [22:51:44] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:51:44] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:52:13] jeena, over to you [22:52:23] Thanks zabe [22:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:56:08] (03PS7) 10Dzahn: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:02:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:05:31] jouncebot: now [23:05:31] No deployments scheduled for the next 7 hour(s) and 54 minute(s) [23:06:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P43041 and previous config saved to /var/cache/conftool/dbconfig/20230111-230603-marostegui.json [23:07:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q3:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH) [23:07:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q3:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH) [23:07:51] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879152 (https://phabricator.wikimedia.org/T325581) [23:07:53] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879152 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [23:08:33] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879152 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [23:15:55] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.18 refs T325581 [23:15:58] T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581 [23:21:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P43042 and previous config saved to /var/cache/conftool/dbconfig/20230111-232109-marostegui.json [23:21:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH) [23:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:22:52] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.18 refs T325581 (duration: 06m 57s) [23:22:56] T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581 [23:32:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:36:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43043 and previous config saved to /var/cache/conftool/dbconfig/20230111-233616-marostegui.json [23:36:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:36:21] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [23:36:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:36:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [23:36:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [23:36:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T321391)', diff saved to https://phabricator.wikimedia.org/P43044 and previous config saved to /var/cache/conftool/dbconfig/20230111-233652-marostegui.json [23:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:39:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T321391)', diff saved to https://phabricator.wikimedia.org/P43045 and previous config saved to /var/cache/conftool/dbconfig/20230111-233916-marostegui.json [23:47:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:53:42] (03PS1) 10Bartosz Dziewoński: Track callers of parseRevisionParsoidHtml. [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879101 [23:54:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P43047 and previous config saved to /var/cache/conftool/dbconfig/20230111-235423-marostegui.json