[00:02:51] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:18] 10SRE, 10Alerting, 10Icinga, 10User-CDanis: CLI script for manual paging - https://phabricator.wikimedia.org/T82937 (10lmata) [00:25:49] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:03] 10SRE, 10Alerting, 10Epic: Monitor and alarm on SMART attributes [tracking] - https://phabricator.wikimedia.org/T86552 (10lmata) [00:41:32] 10SRE, 10Metrics, 10observability, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10lmata) [00:57:15] 10SRE, 10Logging, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10lmata) [00:59:28] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10lmata) [01:01:17] 10SRE, 10Metrics, 10observability, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10lmata) [01:01:32] 10SRE, 10Metrics, 10observability, 10Documentation, 10Graphite: document graphite failover/backfill procedures - https://phabricator.wikimedia.org/T102575 (10lmata) [01:02:19] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:00] 10SRE, 10Alerting: Monitoring: add link to graph for Icinga timeseries alarms - https://phabricator.wikimedia.org/T167422 (10lmata) [01:03:13] 10SRE, 10Metrics: Aggregate prometheus functions yielding different results in grafana vs. prometheus console - https://phabricator.wikimedia.org/T168403 (10lmata) [01:03:39] 10SRE, 10Logging, 10observability, 10Graphite: unused grafana-dashboard indices on elasticsearch / logstash - https://phabricator.wikimedia.org/T174172 (10lmata) [01:07:45] 10SRE, 10Graphite, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10lmata) [01:10:03] 10SRE, 10Metrics: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10lmata) [01:10:17] 10SRE, 10Metrics, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10lmata) [01:10:32] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10lmata) [01:11:09] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10lmata) [01:11:21] 10SRE, 10Metrics, 10observability, 10Graphite: Include ADD operation in memcached stats and grafana dashboard - https://phabricator.wikimedia.org/T201016 (10lmata) [01:11:43] 10SRE, 10Alerting, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10lmata) [01:11:51] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, 10User-herron: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10lmata) [01:12:11] 10SRE, 10Analytics-Radar, 10Logging, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata) [01:12:33] 10SRE, 10Metrics, 10Performance-Team (Radar): "Workers" data from prometheus for mw app servers alternates strangely - https://phabricator.wikimedia.org/T206939 (10lmata) [01:12:49] 10SRE, 10Metrics, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10lmata) [01:12:57] 10SRE, 10Metrics, 10observability, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10lmata) [01:13:23] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10lmata) [01:15:49] 10SRE, 10Alerting, 10Icinga, 10observability: Aggregate Proton, Restbase and mobileapps icinga alerts - https://phabricator.wikimedia.org/T250017 (10lmata) [01:16:13] 10SRE, 10Infrastructure-Foundations, 10Metrics, 10netops: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155 (10lmata) [01:16:23] 10SRE, 10Infrastructure-Foundations, 10Metrics, 10netops: add traceroute measurements to RIPE Atlas prometheus data - https://phabricator.wikimedia.org/T251156 (10lmata) [01:16:58] 10SRE, 10Metrics, 10Traffic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10lmata) [01:17:16] 10SRE, 10Alerting: check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10lmata) [01:19:05] 10SRE, 10Alerting, 10Icinga, 10User-CDanis: CLI script for manual paging - https://phabricator.wikimedia.org/T82937 (10lmata) [01:19:48] 10SRE, 10Metrics, 10observability, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546 (10lmata) [01:20:16] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849 (10lmata) [01:21:23] 10SRE, 10Logging, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, and 2 others: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10lmata) [01:26:53] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:52] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497 (10lmata) [01:47:17] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10lmata) [01:47:51] 10SRE, 10Metrics: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines - https://phabricator.wikimedia.org/T214762 (10lmata) [01:48:09] 10SRE, 10Metrics, 10WMF-Legal, 10observability, and 2 others: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10lmata) [01:48:16] 10SRE, 10Metrics: Update prometheus-node-exporter NTP metrics - https://phabricator.wikimedia.org/T208875 (10lmata) [01:48:43] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Rationalize default logrotate "rotated" file extensions - https://phabricator.wikimedia.org/T207296 (10lmata) [01:50:25] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10lmata) [02:02:45] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:27] 10SRE, 10Metrics, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10lmata) [02:30:44] 10SRE, 10Discovery-Search, 10Logging, 10Patch-For-Review: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10lmata) [02:31:13] 10SRE, 10Metrics: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10lmata) [02:31:37] 10SRE, 10Data-Persistence-Backup, 10Metrics, 10media-backups, and 2 others: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10lmata) [02:32:27] 10SRE, 10Discovery-Search, 10Elasticsearch, 10Logging, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10lmata) [02:32:53] 10SRE, 10Metrics, 10Sustainability (Incident Followup): prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc) - https://phabricator.wikimedia.org/T222102 (10lmata) [02:33:10] 10SRE, 10Metrics, 10observability, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10lmata) [02:33:48] 10SRE, 10Metrics, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10lmata) [02:36:12] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10lmata) [02:36:29] 10SRE, 10Alerting, 10Icinga, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10lmata) [02:38:16] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10lmata) [02:38:39] 10SRE, 10Alerting, 10Infrastructure-Foundations, 10Mail, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10lmata) [02:40:15] 10SRE, 10Logging, 10serviceops, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10lmata) [02:40:20] 10SRE, 10Metrics, 10User-fgiunchedi: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954 (10lmata) [02:41:18] 10SRE, 10Alerting: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10lmata) [02:41:47] 10SRE, 10Metrics, 10Upstream: Grafana error: "parse error at char 1: unexpected character: '\\ufeff'" when copy-pasting metric names - https://phabricator.wikimedia.org/T263624 (10lmata) [02:44:20] 10SRE, 10Metrics: Add slabinfo prometheus exporter - https://phabricator.wikimedia.org/T160071 (10lmata) [02:47:08] 10SRE, 10Alerting, 10Icinga, 10observability, 10serviceops: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata) [02:47:51] 10SRE, 10Metrics, 10observability, 10Graphite: unused grafana-dashboard indices on elasticsearch / logstash - https://phabricator.wikimedia.org/T174172 (10lmata) [02:48:06] 10SRE, 10Elasticsearch, 10Logging, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10lmata) [02:48:28] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10lmata) [02:48:38] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10lmata) [02:48:45] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, 10User-herron: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10lmata) [02:49:08] 10SRE, 10Metrics, 10observability, 10Graphite: Include ADD operation in memcached stats and grafana dashboard - https://phabricator.wikimedia.org/T201016 (10lmata) [02:51:24] 10SRE, 10Analytics-Radar, 10Logging, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata) [02:51:31] 10SRE, 10Metrics, 10User-fgiunchedi: Review prometheus_nodes params - https://phabricator.wikimedia.org/T207292 (10lmata) [02:51:41] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Rationalize default logrotate "rotated" file extensions - https://phabricator.wikimedia.org/T207296 (10lmata) [02:51:51] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10lmata) [02:51:59] 10SRE, 10Metrics, 10WMF-Legal, 10observability, and 2 others: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10lmata) [02:52:08] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497 (10lmata) [02:53:07] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10lmata) [02:53:15] (03PS1) 10BryanDavis: toolhub: Add CronJob for crawer [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) [02:54:03] 10SRE, 10Alerting, 10Icinga, 10User-CDanis: CLI script for manual paging - https://phabricator.wikimedia.org/T82937 (10lmata) [02:54:18] 10SRE, 10Metrics, 10observability, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10lmata) [02:55:50] 10SRE, 10Metrics, 10observability, 10Documentation, 10Graphite: document graphite failover/backfill procedures - https://phabricator.wikimedia.org/T102575 (10lmata) [02:58:49] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10lmata) [03:00:12] (03CR) 10BryanDavis: "The easiest way to test the job created by the scheduler without waiting for the wall clock to fire it seems to be manually creating one:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [03:02:21] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:24] 10SRE, 10Logging, 10Privacy Engineering, 10Wikimedia-Logstash, and 2 others: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10lmata) [03:06:27] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10lmata) [03:07:09] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400 (10lmata) [03:08:00] 10SRE, 10Citoid, 10Logging, 10Wikimedia-Logstash, and 3 others: Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10lmata) [03:09:57] 10SRE, 10SRE Observability (FY2021/2022-Q1): Icinga refresh hardware selection (2020) - https://phabricator.wikimedia.org/T251644 (10lmata) [03:16:23] 10SRE, 10Logging, 10MW-on-K8s, 10serviceops: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10lmata) [03:17:03] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849 (10lmata) [03:17:27] 10SRE, 10Alerting: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10lmata) [03:17:59] 10SRE, 10Metrics, 10observability, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546 (10lmata) [03:19:18] 10SRE, 10Logging, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, and 2 others: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10lmata) [03:19:29] 10SRE, 10Metrics, 10observability, 10Graphite: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821 (10lmata) [03:22:08] 10SRE, 10Discovery-Search, 10Elasticsearch, 10Metrics, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10lmata) [03:22:27] 10SRE, 10Metrics, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10lmata) [03:23:26] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10lmata) [03:24:13] 10SRE, 10Logging, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10lmata) [03:26:58] 10SRE, 10Logging, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10lmata) [03:27:15] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:53] 10SRE, 10Logging: Monitor the BMC's event log for hardware errors - https://phabricator.wikimedia.org/T136311 (10lmata) [03:31:36] 10SRE, 10Infrastructure-Foundations, 10Metrics, 10CAS-SSO, and 3 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10lmata) [03:32:41] 10SRE, 10Alerting: librenms page didn't auto-resolve in VO - https://phabricator.wikimedia.org/T263423 (10lmata) [03:33:33] 10SRE, 10Alerting: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643 (10lmata) [03:34:26] 10SRE, 10Metrics, 10observability, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10lmata) [03:35:20] 10SRE, 10SRE-tools, 10Alerting, 10Icinga, and 2 others: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10lmata) [03:35:34] 10SRE, 10Alerting: automation: issue reminders for about-to-expire downtimes - https://phabricator.wikimedia.org/T230633 (10lmata) [03:37:55] 10Puppet, 10SRE, 10Alerting, 10Infrastructure-Foundations: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10lmata) [03:38:44] 10SRE, 10Infrastructure-Foundations, 10Logging, 10netops: Provision plaintext syslog collectors in esams/ulsfo/eqsin - https://phabricator.wikimedia.org/T243065 (10lmata) [03:39:32] 10SRE, 10Metrics, 10User-fgiunchedi: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434 (10lmata) [03:43:06] 10SRE, 10Alerting, 10Icinga, 10observability: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407 (10lmata) [03:43:39] 10SRE, 10Alerting, 10Icinga, 10observability: Aggregate Proton, Restbase and mobileapps icinga alerts - https://phabricator.wikimedia.org/T250017 (10lmata) [03:56:20] (03PS1) 10KartikMistry: Update cxserver to 2021-08-06-062053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/710727 (https://phabricator.wikimedia.org/T288272) [04:01:43] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:41] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:13] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:51] (03PS1) 10Marostegui: db1104: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/710729 (https://phabricator.wikimedia.org/T286226) [05:20:52] (03CR) 10Marostegui: [C: 03+2] db1104: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/710729 (https://phabricator.wikimedia.org/T286226) (owner: 10Marostegui) [05:21:13] (03PS2) 10Marostegui: wmnet: Update s2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/710517 (https://phabricator.wikimedia.org/T287454) [05:21:23] (03PS2) 10Marostegui: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/710516 (https://phabricator.wikimedia.org/T287454) [05:22:58] !log Optimize commonswiki.image on eqiad, lag will appear - T288273 [05:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:08] T288273: Please optimize image table in commonswiki - https://phabricator.wikimedia.org/T288273 [05:23:28] !log Lag in s4 (commonswiki) will appear on clouddb* hosts (wiki replicas) T288273 [05:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:11] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:48] 10SRE, 10Security-Team, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T288427 (10Vishu_aggarwal) p:05Triage→03Medium [05:56:16] !log enable cloudsw1-c8 interfaces toward cloudsw2-c8 - T277340 [05:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:25] T277340: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 [06:00:15] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) cloudsw2-c8 is ready to receive servers. @Jclark-ctr please let us know when cloudsw2-d5 is ready for Netops, and @cmooney will take care of configuring it. [06:01:05] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:08] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen eqiad-codfw link down - https://phabricator.wikimedia.org/T288218 (10ayounsi) 05Open→03Resolved Back up Friday 6th, around 10am. Reason was fibercut due to a fire. [06:25:45] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:49] 10SRE, 10Security-Team, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T288427 (10RhinosF1) Disabled the account but no idea how to bypass the edit policy and close it. (Or if that's possible without database fiddling) [06:37:08] * kart__ updating cxserver.. [06:38:08] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-08-06-062053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/710727 (https://phabricator.wikimedia.org/T288272) (owner: 10KartikMistry) [06:40:47] (03Merged) 10jenkins-bot: Update cxserver to 2021-08-06-062053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/710727 (https://phabricator.wikimedia.org/T288272) (owner: 10KartikMistry) [06:45:19] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:06] (03PS1) 10Marostegui: mariadb: Move db1107 to m3. [puppet] - 10https://gerrit.wikimedia.org/r/710916 (https://phabricator.wikimedia.org/T288197) [06:51:08] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1107 to m3. [puppet] - 10https://gerrit.wikimedia.org/r/710916 (https://phabricator.wikimedia.org/T288197) (owner: 10Marostegui) [06:53:07] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:21] (03CR) 10Giuseppe Lavagetto: "The change per-se does the right thing, assuming no reference to s10 is contained in any other object." [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [07:01:33] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre.switchdc.services: Exclude helm-charts, lacking a service IP [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) (owner: 10Legoktm) [07:02:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1107.eqiad.wmnet with reason: REIMAGE [07:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] shellbox: Disable php-fpm slowlog [deployment-charts] - 10https://gerrit.wikimedia.org/r/710607 (https://phabricator.wikimedia.org/T288315) (owner: 10Legoktm) [07:04:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1107.eqiad.wmnet with reason: REIMAGE [07:04:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "excimer is installed by the cli image, so it's already available. We should add php-wmerrors though." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710621 (https://phabricator.wikimedia.org/T285309) (owner: 10Ahmon Dancy) [07:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:48] !log Updated cxserver to 2021-08-06-062053-production (T288272) [07:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:54] T288272: Elia: ca->es pair does not work - https://phabricator.wikimedia.org/T288272 [07:07:06] (03CR) 10JMeybohm: "Complete PCC run: https://puppet-compiler.wmflabs.org/compiler1003/30514/" [puppet] - 10https://gerrit.wikimedia.org/r/710566 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [07:15:31] !log Stop db1117:3323 to clone db1107 - T288197 [07:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:39] T288197: Failover m3 (phabricator) master (db1132) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T288197 [07:17:58] ema: Good luck ^! [07:18:13] thank you marostegui! [07:23:53] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:46] (03PS1) 10Ladsgroup: Enable shellbox for constraint for all of wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710919 (https://phabricator.wikimedia.org/T176312) [07:28:01] (03CR) 10Ladsgroup: [C: 03+2] Enable shellbox for constraint for all of wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710919 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [07:28:59] (03Merged) 10jenkins-bot: Enable shellbox for constraint for all of wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710919 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [07:30:19] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710919|Enable shellbox for constraint for all of wikidata (T176312)]] (duration: 00m 58s) [07:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:28] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [07:42:00] (03CR) 10JMeybohm: [C: 04-1] Add the Kubeflow storage initializer docker image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [07:45:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1160 T288273', diff saved to https://phabricator.wikimedia.org/P16971 and previous config saved to /var/cache/conftool/dbconfig/20210809-075212-marostegui.json [07:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:20] T288273: Please optimize image table in commonswiki - https://phabricator.wikimedia.org/T288273 [07:53:52] (03CR) 10Giuseppe Lavagetto: "I have one question specifically regarding the init_db job, otherwise LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [07:56:35] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:56:54] ^ me [07:57:10] (03PS1) 10ArielGlenn: If a run is interrupted weirdly, the runsettings file may be empty, handle this [dumps] - 10https://gerrit.wikimedia.org/r/710922 (https://phabricator.wikimedia.org/T288192) [08:00:18] (03CR) 10ArielGlenn: [C: 03+2] If a run is interrupted weirdly, the runsettings file may be empty, handle this [dumps] - 10https://gerrit.wikimedia.org/r/710922 (https://phabricator.wikimedia.org/T288192) (owner: 10ArielGlenn) [08:00:21] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:00:44] (03Merged) 10jenkins-bot: If a run is interrupted weirdly, the runsettings file may be empty, handle this [dumps] - 10https://gerrit.wikimedia.org/r/710922 (https://phabricator.wikimedia.org/T288192) (owner: 10ArielGlenn) [08:01:03] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:06] (03PS1) 10ArielGlenn: fix a few pylint whines in runnerutils since I was in there [dumps] - 10https://gerrit.wikimedia.org/r/710923 [08:01:54] (03CR) 10ArielGlenn: [C: 03+2] fix a few pylint whines in runnerutils since I was in there [dumps] - 10https://gerrit.wikimedia.org/r/710923 (owner: 10ArielGlenn) [08:02:33] (03Merged) 10jenkins-bot: fix a few pylint whines in runnerutils since I was in there [dumps] - 10https://gerrit.wikimedia.org/r/710923 (owner: 10ArielGlenn) [08:03:18] (03CR) 10Giuseppe Lavagetto: "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [08:03:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] toolhub: Add CronJob for crawer [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [08:03:43] !log ariel@deploy1002 Started deploy [dumps/dumps@142e91c]: fix for T288192 runnerutils bug [08:03:46] !log ariel@deploy1002 Finished deploy [dumps/dumps@142e91c]: fix for T288192 runnerutils bug (duration: 00m 03s) [08:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:51] T288192: dumps of small wikis are hanging since four days - https://phabricator.wikimedia.org/T288192 [08:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:25] (03PS2) 10Jcrespo: dbbackups: Reimage dbprov2002 to buster [puppet] - 10https://gerrit.wikimedia.org/r/708736 (https://phabricator.wikimedia.org/T287230) [08:10:54] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reimage dbprov2002 to buster [puppet] - 10https://gerrit.wikimedia.org/r/708736 (https://phabricator.wikimedia.org/T287230) (owner: 10Jcrespo) [08:13:45] (03CR) 10Marostegui: [C: 03+1] Conftool-sections: farewell s10 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [08:18:07] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:20:35] (03PS2) 10Jcrespo: dbbackups: Reorganize backups after dbprov2002 reimage [puppet] - 10https://gerrit.wikimedia.org/r/708737 (https://phabricator.wikimedia.org/T287230) [08:24:14] !log Upgrade db1117 (all sections) to 10.4.19 [08:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:43] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10jcrespo) I am extending the downtime for a week from now so it doesn't alert while shutdown. [08:34:09] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2002.codfw.wmnet with reason: REIMAGE [08:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:01] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10fgiunchedi) [08:36:07] 10SRE, 10Data-Persistence-Backup, 10Metrics, 10media-backups, and 2 others: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) [08:36:25] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2002.codfw.wmnet with reason: REIMAGE [08:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:19] (03PS1) 10Ladsgroup: Increase post edit constraint jobs to 85% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710925 (https://phabricator.wikimedia.org/T204031) [08:39:03] (03PS1) 10JMeybohm: admin_ng: Add a new PSP for MediaWiki and allow to use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/710926 (https://phabricator.wikimedia.org/T288315) [08:41:45] !log upgrade prometheus on prometheus1004 - T222113 [08:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:52] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [08:46:21] !log upgrade prometheus on prometheus2004 - T222113 [08:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:25] (03CR) 10Filippo Giunchedi: "LGTM overall, couple of roles are missing:" [puppet] - 10https://gerrit.wikimedia.org/r/710617 (owner: 10Cwhite) [08:54:07] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reorganize backups after dbprov2002 reimage [puppet] - 10https://gerrit.wikimedia.org/r/708737 (https://phabricator.wikimedia.org/T287230) (owner: 10Jcrespo) [09:02:15] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:43] 10SRE, 10Data-Persistence-Backup, 10Metrics, 10media-backups, and 2 others: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete! `2.24.1+ds-1+wmf1` is running in production [09:08:57] (03PS1) 10Btullis: Bring an-druid1003.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710927 (https://phabricator.wikimedia.org/T255148) [09:13:24] (03CR) 10JMeybohm: admin_ng: Add a new PSP for MediaWiki and allow to use it (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710926 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [09:15:12] (03PS3) 10JMeybohm: kubernetes::node: Add node.kubernetes.io/disk-type annotation [puppet] - 10https://gerrit.wikimedia.org/r/710566 (https://phabricator.wikimedia.org/T288345) [09:17:24] (03CR) 10jerkins-bot: [V: 04-1] kubernetes::node: Add node.kubernetes.io/disk-type annotation [puppet] - 10https://gerrit.wikimedia.org/r/710566 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [09:17:35] (03CR) 10Hnowlan: [C: 03+2] maps: reenable tilerator on maps2005 [puppet] - 10https://gerrit.wikimedia.org/r/710591 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [09:19:53] (03PS2) 10Hnowlan: maps: reimage maps1005 as buster imposm replica [puppet] - 10https://gerrit.wikimedia.org/r/710582 (https://phabricator.wikimedia.org/T269582) [09:22:45] (03PS1) 10Filippo Giunchedi: Revert "netops: temporarily skip externallabels in alerts" [alerts] - 10https://gerrit.wikimedia.org/r/710928 [09:24:20] (03CR) 10Hnowlan: [C: 03+2] maps: reimage maps1005 as buster imposm replica [puppet] - 10https://gerrit.wikimedia.org/r/710582 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [09:25:23] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "netops: temporarily skip externallabels in alerts" [alerts] - 10https://gerrit.wikimedia.org/r/710928 (owner: 10Filippo Giunchedi) [09:25:35] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps200[1234].codfw.wmnet [09:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:17] (03PS1) 10Filippo Giunchedi: alerts: use PosixPath.as_posix() when opening files [puppet] - 10https://gerrit.wikimedia.org/r/710931 [09:39:19] (03CR) 10Michael Große: [C: 03+1] Increase post edit constraint jobs to 85% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710925 (https://phabricator.wikimedia.org/T204031) (owner: 10Ladsgroup) [09:42:23] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: use PosixPath.as_posix() when opening files [puppet] - 10https://gerrit.wikimedia.org/r/710931 (owner: 10Filippo Giunchedi) [09:44:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1005.eqiad.wmnet with reason: REIMAGE [09:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:47] (03CR) 10Ladsgroup: [C: 03+2] "Deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710925 (https://phabricator.wikimedia.org/T204031) (owner: 10Ladsgroup) [09:45:31] (03Merged) 10jenkins-bot: Increase post edit constraint jobs to 85% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710925 (https://phabricator.wikimedia.org/T204031) (owner: 10Ladsgroup) [09:46:40] (03PS4) 10JMeybohm: kubernetes::node: Add node.kubernetes.io/disk-type annotation [puppet] - 10https://gerrit.wikimedia.org/r/710566 (https://phabricator.wikimedia.org/T288345) [09:46:46] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1005.eqiad.wmnet with reason: REIMAGE [09:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:17] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:00] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/710516 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [09:49:16] (03CR) 10Kormat: [C: 03+1] wmnet: Update s2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/710517 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [09:49:30] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710925|Increase post edit constraint jobs to 85% of edits (T204031)]] (duration: 00m 58s) [09:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:38] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [09:50:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:22] (03PS1) 10Kormat: utils: Add support for Hosts: comments to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/710932 [09:53:41] (03PS2) 10Kormat: utils: Add support for Hosts: comments to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/710932 [09:55:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:17] (03PS2) 10Btullis: Bring an-druid1003.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710927 (https://phabricator.wikimedia.org/T255148) [09:58:02] (03PS3) 10Btullis: Bring an-druid1003.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710927 (https://phabricator.wikimedia.org/T255148) [09:59:38] (03PS1) 10Hnowlan: profile::maps: remove old postgres init script [puppet] - 10https://gerrit.wikimedia.org/r/710933 [09:59:41] (03PS1) 10Filippo Giunchedi: Revert "Revert "Fix NavtimingStaleBeacon false alarms, add test"" [alerts] - 10https://gerrit.wikimedia.org/r/710934 [10:00:37] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:01:51] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:03] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:02:06] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Revert "Fix NavtimingStaleBeacon false alarms, add test"" [alerts] - 10https://gerrit.wikimedia.org/r/710934 (owner: 10Filippo Giunchedi) [10:07:12] (03CR) 10Btullis: [C: 03+2] Bring an-druid1003.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710927 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [10:07:54] (03PS2) 10Hnowlan: maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) [10:09:51] (03CR) 10Abijeet Patro: [C: 03+1] Review access change [software/mailman-templates] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709976 (https://phabricator.wikimedia.org/T288027) (owner: 10Hashar) [10:10:11] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:57] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:10] (03PS1) 10Ladsgroup: Enable post edit constraint jobs in all edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710936 (https://phabricator.wikimedia.org/T204031) [10:17:19] (deploying a beta cluster config change) [10:18:24] (03PS2) 10Awight: [beta] Enable new VE template dialog sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709027 (https://phabricator.wikimedia.org/T286765) [10:18:33] (03CR) 10Awight: [C: 03+2] "Beta-only deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709027 (https://phabricator.wikimedia.org/T286765) (owner: 10Awight) [10:19:47] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:22:11] (03Merged) 10jenkins-bot: [beta] Enable new VE template dialog sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709027 (https://phabricator.wikimedia.org/T286765) (owner: 10Awight) [10:22:54] (03PS1) 10Ladsgroup: Enable shellbox constraint for commons wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710938 (https://phabricator.wikimedia.org/T176312) [10:23:05] (03CR) 10Ladsgroup: [C: 03+2] Enable post edit constraint jobs in all edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710936 (https://phabricator.wikimedia.org/T204031) (owner: 10Ladsgroup) [10:23:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:29] (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: better handle errors [puppet] - 10https://gerrit.wikimedia.org/r/710940 [10:25:38] (03PS6) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [10:25:43] (03CR) 10jerkins-bot: [V: 04-1] Enable shellbox constraint for commons wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710938 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [10:25:45] (03Merged) 10jenkins-bot: Enable post edit constraint jobs in all edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710936 (https://phabricator.wikimedia.org/T204031) (owner: 10Ladsgroup) [10:26:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:05] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710936|Enable post edit constraint jobs in all edits (T204031)]] (duration: 00m 58s) [10:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:13] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [10:27:14] (03PS2) 10Ladsgroup: Enable shellbox constraint for commons wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710938 (https://phabricator.wikimedia.org/T176312) [10:27:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:05] (03PS3) 10Ladsgroup: Enable shellbox constraint for commons wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710938 (https://phabricator.wikimedia.org/T176312) [10:29:57] Amir1: fyi I'm deploying CommonSettings-labs.php [10:30:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210809T1030). [10:30:08] awight: I rebased it :D [10:30:13] ty! [10:30:18] it is automatic deploy so that should be all [10:30:53] (03PS1) 10Ema: cache: deploy prometheus varnish exporter after varnish [puppet] - 10https://gerrit.wikimedia.org/r/710941 (https://phabricator.wikimedia.org/T283660) [10:31:11] !log awight@deploy1002 sync-file aborted: Config: [[gerrit:709027|[beta] Enable new VE template dialog sidebar (T286765)]] (duration: 00m 23s) [10:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:20] T286765: Prepare sidebar feature for deploy on beta and deploy on beta - https://phabricator.wikimedia.org/T286765 [10:31:26] harr I can never get that right. Okay, I aborted at the canary step so hopefully have not left a mess. [10:31:29] done! [10:31:36] (03PS4) 10Kormat: mariadb: Add specific role for sanitarium masters. [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) [10:32:25] (03CR) 10Ladsgroup: [C: 03+2] Enable shellbox constraint for commons wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710938 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [10:33:41] (03Merged) 10jenkins-bot: Enable shellbox constraint for commons wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710938 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [10:33:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:54] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710938|Enable shellbox constraint for commons wikis (T176312)]] (duration: 00m 57s) [10:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:02] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [10:39:29] (03PS1) 10Ayounsi: discard traffic to mx2002 tcp/25 [homer/public] - 10https://gerrit.wikimedia.org/r/710943 (https://phabricator.wikimedia.org/T286911) [10:41:37] (03PS1) 10Urbanecm: dewiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710944 (https://phabricator.wikimedia.org/T288420) [10:42:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:09] (03PS2) 10Giuseppe Lavagetto: deploy-mwdebug: better handle errors [puppet] - 10https://gerrit.wikimedia.org/r/710940 [10:51:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "AFAIU this should do the right thing." [deployment-charts] - 10https://gerrit.wikimedia.org/r/710926 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [10:54:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin_ng: Add a new PSP for MediaWiki and allow to use it (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710926 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [10:55:58] (03PS1) 10Kormat: db1136: Disable notifications for reimage. [puppet] - 10https://gerrit.wikimedia.org/r/710946 (https://phabricator.wikimedia.org/T288244) [10:56:00] (03PS1) 10Kormat: install_server: Switch db1136 to buster [puppet] - 10https://gerrit.wikimedia.org/r/710947 (https://phabricator.wikimedia.org/T288244) [10:57:29] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I think this is an ok compromise while we wait for writing a proper fact." [puppet] - 10https://gerrit.wikimedia.org/r/710566 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [10:58:02] (03PS6) 10Elukey: Add the Kubeflow storage initializer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) [10:58:56] (03CR) 10Kormat: [C: 03+2] db1136: Disable notifications for reimage. [puppet] - 10https://gerrit.wikimedia.org/r/710946 (https://phabricator.wikimedia.org/T288244) (owner: 10Kormat) [10:58:58] (03CR) 10Elukey: Add the Kubeflow storage initializer docker image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [10:59:02] (03CR) 10Kormat: [C: 03+2] install_server: Switch db1136 to buster [puppet] - 10https://gerrit.wikimedia.org/r/710947 (https://phabricator.wikimedia.org/T288244) (owner: 10Kormat) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210809T1100). [11:00:04] Kizule, Lucas_WMDE, Urbanecm, and zabe: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] \o [11:00:09] o/ [11:00:12] o/ [11:00:16] oof, many changes… [11:00:31] I can deploy today, unless Lucas_WMDE wants to :) [11:01:01] (03CR) 10Urbanecm: [C: 03+2] Disable local uploads for non-administrators on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710636 (https://phabricator.wikimedia.org/T288386) (owner: 10Zoranzoki21) [11:01:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:05] do you want to start with your config chang? mine’s a noop, it can wait [11:01:18] (and Kizule isn’t here yet) [11:01:25] (03PS1) 10Jelto: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) [11:01:28] right [11:01:48] mine is a "bit" complex to get out [11:01:52] I'll start with zabe [11:01:54] (03CR) 10jerkins-bot: [V: 04-1] profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [11:01:56] (03PS3) 10Urbanecm: Add *.happysrv.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709990 (https://phabricator.wikimedia.org/T288039) (owner: 10Zabe) [11:02:11] (03CR) 10Urbanecm: [C: 03+2] Add *.happysrv.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709990 (https://phabricator.wikimedia.org/T288039) (owner: 10Zabe) [11:03:11] (03Merged) 10jenkins-bot: Add *.happysrv.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709990 (https://phabricator.wikimedia.org/T288039) (owner: 10Zabe) [11:03:25] syncing, trivial change [11:03:26] Hi Urbanecm, I'm here. [11:03:32] Sorry for lating. :) [11:03:34] (03PS2) 10Jelto: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) [11:03:47] hi Kizule [11:04:41] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 54c532f4d05c6c3f8ab39d3693e481a92d1ccdf7: Add *.happysrv.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T288039) (duration: 00m 58s) [11:04:46] zabe: first one is live [11:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:50] T288039: Add happysrv.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T288039 [11:04:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 15 hosts with reason: Reimage db1136 (s7 primary) to buster T288244 [11:04:54] (03PS2) 10Urbanecm: Enable GeoData on zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710303 (https://phabricator.wikimedia.org/T287807) (owner: 10Zabe) [11:04:56] (03CR) 10Urbanecm: [C: 03+2] Enable GeoData on zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710303 (https://phabricator.wikimedia.org/T287807) (owner: 10Zabe) [11:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:00] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [11:05:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 15 hosts with reason: Reimage db1136 (s7 primary) to buster T288244 [11:05:05] (03CR) 10jerkins-bot: [V: 04-1] profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [11:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:47] (03Merged) 10jenkins-bot: Enable GeoData on zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710303 (https://phabricator.wikimedia.org/T287807) (owner: 10Zabe) [11:06:47] (03CR) 10Marostegui: [C: 04-1] "Some typos" [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) (owner: 10Kormat) [11:07:16] marostegui: not typos, unexpected features! [11:07:21] zabe: your second change is at mwdebug2001, please test. [11:07:26] I know you were testing me [11:07:58] (required db table is already there, it's standard on all wikis in Wikimedia setup) [11:09:01] (03PS5) 10Kormat: mariadb: Add specific role for sanitarium masters. [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) [11:09:40] looks good to me [11:09:47] thanks, syncing [11:10:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 037aceb7f575d77930627f5062e183d514616f16: Enable GeoData on zhwikinews (T287807) (duration: 00m 57s) [11:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:09] zabe: and should be live [11:11:10] T287807: Install Extension:GeoData for zhwikinews - https://phabricator.wikimedia.org/T287807 [11:11:12] Kizule: you're next :) [11:11:20] thanks :) [11:11:21] (03PS2) 10Urbanecm: Disable local uploads for non-administrators on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710636 (https://phabricator.wikimedia.org/T288386) (owner: 10Zoranzoki21) [11:11:25] (03CR) 10Urbanecm: [C: 03+2] Disable local uploads for non-administrators on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710636 (https://phabricator.wikimedia.org/T288386) (owner: 10Zoranzoki21) [11:11:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:11:30] any time zabe [11:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:36] Okay urbanecm. [11:12:08] (03Merged) 10jenkins-bot: Disable local uploads for non-administrators on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710636 (https://phabricator.wikimedia.org/T288386) (owner: 10Zoranzoki21) [11:12:57] Kizule: pulled to mwdebug2001, please have a look [11:12:58] (03CR) 10Kormat: mariadb: Add specific role for sanitarium masters. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) (owner: 10Kormat) [11:13:12] urbanecm: Okay [11:13:17] * Kizule testing... [11:14:03] urbanecm: Looks good. [11:14:07] thanks, syncing [11:14:34] (03PS2) 10Urbanecm: dewiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710944 (https://phabricator.wikimedia.org/T288420) [11:14:37] (03CR) 10Urbanecm: [C: 03+2] dewiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710944 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [11:14:39] (03CR) 10Marostegui: [C: 03+1] mariadb: Add specific role for sanitarium masters. [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) (owner: 10Kormat) [11:15:00] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/710954 (owner: 10L10n-bot) [11:15:23] !log urbanecm@deploy1002 Synchronized dblists/commonsuploads.dblist: 9b9bb5b145fd67074c8122e0ddcba1b1e859bb78: Disable local uploads for non-administrators on nlwiki (T288386) (duration: 00m 57s) [11:15:24] (03Merged) 10jenkins-bot: dewiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710944 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [11:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:31] T288386: nl.wikipedia - Disable local uploads for non-administrators on nlwiki - https://phabricator.wikimedia.org/T288386 [11:15:32] Kizule: should be live. [11:15:49] !log [urbanecm@mwmaint2002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=dewiki growthexperiments # T288420 [11:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:57] T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420 [11:16:05] urbanecm: Everything is good. [11:16:09] great! [11:16:49] !log [urbanecm@mwmaint2002 ~]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=dewiki --phab=T288420 # T288420 [11:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:09] (03PS1) 10Btullis: Create the druid user and group before installing druid-common [puppet] - 10https://gerrit.wikimedia.org/r/710961 (https://phabricator.wikimedia.org/T255148) [11:19:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:05] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:06] syncing my own patch, everything works well [11:21:57] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d6564351b28d3755369736f95c36063f8b980a22: dewiki: Enable Growth features in dark mode (T288420; 1/3) (duration: 00m 57s) [11:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:05] T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420 [11:23:05] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: d6564351b28d3755369736f95c36063f8b980a22: dewiki: Enable Growth features in dark mode (T288420; 2/3) (duration: 00m 57s) [11:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:17] !log urbanecm@deploy1002 Synchronized wmf-config/config/dewiki.yaml: d6564351b28d3755369736f95c36063f8b980a22: dewiki: Enable Growth features in dark mode (T288420; 3/3) (duration: 00m 57s) [11:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:25] that should be all [11:24:27] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:29] Lucas_WMDE: all yours :) [11:24:35] \o/ [11:25:32] !log >>> \MediaWiki\MediaWikiServices::getInstance()->get('GrowthExperimentsWikiPageConfigLoader')->invalidate(Title::newFromText('MediaWiki:GrowthExperimentsConfig.json')) # dewiki shell.php; debugging Growth's wiki config [11:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting $wgWBClientSettings['repositories'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706341 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:25:53] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['repositories'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706341 (https://phabricator.wikimedia.org/T257260) [11:26:00] clicked the buttons in the wrong order [11:26:04] :-) [11:26:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706341 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:26:29] ah, now it’s in zuul [11:27:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1136.eqiad.wmnet with reason: REIMAGE [11:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:31] (03Merged) 10jenkins-bot: Stop setting $wgWBClientSettings['repositories'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706341 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:27:49] testing on mwdebug2001 [11:29:13] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1136.eqiad.wmnet with reason: REIMAGE [11:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:49] seems to work, syncing [11:31:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:706341|Stop setting $wgWBClientSettings['repositories'] (T257260)]] (duration: 00m 57s) [11:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:19] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [11:31:34] (03PS2) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseClientRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706342 (https://phabricator.wikimedia.org/T257260) [11:31:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove wmgWikibaseClientRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706342 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:32:30] (03Merged) 10jenkins-bot: Remove wmgWikibaseClientRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706342 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:33:39] syncing this one directly, it’s an unused wmg [11:34:26] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:706342|Remove wmgWikibaseClientRepositories (T257260)]] (1/2, prod) (duration: 00m 57s) [11:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:09] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['repoNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705857 (https://phabricator.wikimedia.org/T257260) [11:35:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting $wgWBClientSettings['repoNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705857 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:35:35] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:706342|Remove wmgWikibaseClientRepositories (T257260)]] (2/2, beta) (duration: 00m 56s) [11:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:43] (03Merged) 10jenkins-bot: Stop setting $wgWBClientSettings['repoNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705857 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:37:07] testing on mwdebug2001 again [11:37:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:06] syncing [11:39:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:705857|Stop setting $wgWBClientSettings['repoNamespaces'] (T257260)]] (duration: 00m 57s) [11:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:04] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [11:40:11] (03PS2) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseClientRepoNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705858 (https://phabricator.wikimedia.org/T257260) [11:40:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove wmgWikibaseClientRepoNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705858 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:41:33] (03Merged) 10jenkins-bot: Remove wmgWikibaseClientRepoNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705858 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:43:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:705858|Remove wmgWikibaseClientRepoNamespaces (T257260)]] (duration: 00m 57s) [11:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:37] I think that’s it, anything else to deploy? [11:43:43] not from me [11:44:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:16] !log EU backport+config window done [11:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:27] 10SRE, 10Release-Engineering-Team: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Qgil) [11:44:47] (03CR) 10JMeybohm: Add the Kubeflow storage initializer docker image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [11:45:14] 10SRE, 10Release-Engineering-Team: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Qgil) p:05Triage→03High [11:45:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:18] (03CR) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['repoDatabase'] (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708308 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:47:03] 10SRE, 10Release-Engineering-Team: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Qgil) I took the liberty of marking this task High priority and setting a deadline for this Friday. It is not clear to me who should be assigned to this task. I a... [11:50:05] !log disabling puppet on all kubernetes nodes - T288345 [11:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:12] T288345: Only schedule mediawiki pods on nodes with non-spinning disks - https://phabricator.wikimedia.org/T288345 [11:50:19] (03CR) 10JMeybohm: [C: 03+2] kubernetes::node: Add node.kubernetes.io/disk-type annotation [puppet] - 10https://gerrit.wikimedia.org/r/710566 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [11:53:24] !log running puppet on kubernetes staging nodes (-b1 -s10) - T288345 [11:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:47] (03CR) 10Hoo man: [C: 03+1] Reduce ten seconds from dispatch max time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710515 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [11:58:03] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet [11:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:01] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:03] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10ArielGlenn) [12:14:33] (03Abandoned) 10Filippo Giunchedi: WIP role::cache::text testing [puppet] - 10https://gerrit.wikimedia.org/r/566246 (owner: 10Filippo Giunchedi) [12:14:52] (03Abandoned) 10Filippo Giunchedi: WIP: remove json_lines tcp [puppet] - 10https://gerrit.wikimedia.org/r/564866 (owner: 10Filippo Giunchedi) [12:14:58] (03Abandoned) 10Filippo Giunchedi: hieradata: turn down logstash tcp json_lines endpoint [puppet] - 10https://gerrit.wikimedia.org/r/565573 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [12:16:57] (03PS1) 10Btullis: Bring an-druid1004.equad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710963 (https://phabricator.wikimedia.org/T255148) [12:20:31] (03CR) 10Btullis: [C: 03+2] Create the druid user and group before installing druid-common [puppet] - 10https://gerrit.wikimedia.org/r/710961 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:20:45] (03CR) 10Btullis: [C: 03+2] Bring an-druid1004.equad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710963 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:27:07] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:49] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10fgiunchedi) [12:38:27] (03PS3) 10Jelto: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) [12:38:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2128 T288398', diff saved to https://phabricator.wikimedia.org/P16973 and previous config saved to /var/cache/conftool/dbconfig/20210809-123852-marostegui.json [12:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:04] T288398: Drop two now-unused indexes on flaggedrevs table - https://phabricator.wikimedia.org/T288398 [12:39:16] (03PS2) 10Ema: cache: deploy prometheus varnish exporter after varnish [puppet] - 10https://gerrit.wikimedia.org/r/710941 (https://phabricator.wikimedia.org/T283660) [12:39:18] (03PS7) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [12:39:25] (03PS1) 10Filippo Giunchedi: hieradata: add ms-be20[62-65] [puppet] - 10https://gerrit.wikimedia.org/r/710964 (https://phabricator.wikimedia.org/T288458) [12:41:07] (03PS1) 10Giuseppe Lavagetto: mwdebug: only deploy to nodes with ssd disks [deployment-charts] - 10https://gerrit.wikimedia.org/r/710965 (https://phabricator.wikimedia.org/T288345) [12:41:09] (03PS1) 10Giuseppe Lavagetto: mwdebug: remove from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/710966 [12:41:11] (03PS1) 10JMeybohm: Add Kubernetes 1.17+ typolofy annotations [puppet] - 10https://gerrit.wikimedia.org/r/710967 (https://phabricator.wikimedia.org/T270191) [12:42:31] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: remove from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/710966 (owner: 10Giuseppe Lavagetto) [12:42:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2128 (re)pooling @ 10%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P16974 and previous config saved to /var/cache/conftool/dbconfig/20210809-124247-root.json [12:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:59] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add ms-be20[62-65] [puppet] - 10https://gerrit.wikimedia.org/r/710964 (https://phabricator.wikimedia.org/T288458) (owner: 10Filippo Giunchedi) [12:45:06] (03PS1) 10MMandere: Traffic: Add varnish prometheus exporter alert [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) [12:45:08] (03PS4) 10Jelto: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) [12:46:49] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add Kubernetes 1.17+ typolofy annotations [puppet] - 10https://gerrit.wikimedia.org/r/710967 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [12:50:11] (03PS2) 10JMeybohm: Add Kubernetes 1.17+ typology annotations [puppet] - 10https://gerrit.wikimedia.org/r/710967 (https://phabricator.wikimedia.org/T270191) [12:51:19] (03PS5) 10Jelto: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) [12:51:33] (03PS8) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [12:51:35] (03PS1) 10Ema: trafficserver: ensure sysconfdir exists on default instance [puppet] - 10https://gerrit.wikimedia.org/r/710969 [12:54:16] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30517/console" [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [12:57:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2128 (re)pooling @ 20%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P16975 and previous config saved to /var/cache/conftool/dbconfig/20210809-125750-root.json [12:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:03] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:31] (03CR) 10JMeybohm: [C: 03+2] Add Kubernetes 1.17+ typology annotations [puppet] - 10https://gerrit.wikimedia.org/r/710967 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [13:06:01] (03PS1) 10JMeybohm: k8s::kubelet: Fix a typo in daemon args [puppet] - 10https://gerrit.wikimedia.org/r/710970 [13:06:41] (03CR) 10JMeybohm: [C: 03+2] k8s::kubelet: Fix a typo in daemon args [puppet] - 10https://gerrit.wikimedia.org/r/710970 (owner: 10JMeybohm) [13:12:17] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:23] thats me [13:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2128 (re)pooling @ 40%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P16976 and previous config saved to /var/cache/conftool/dbconfig/20210809-131254-root.json [13:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:59] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:30] (03PS3) 10Ema: cache: deploy prometheus varnish exporter after varnish [puppet] - 10https://gerrit.wikimedia.org/r/710941 (https://phabricator.wikimedia.org/T283660) [13:14:32] (03PS2) 10Ema: trafficserver: ensure sysconfdir exists on default instance [puppet] - 10https://gerrit.wikimedia.org/r/710969 [13:14:34] (03PS9) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [13:14:36] (03PS1) 10Ema: cache: use confd only if backend list is on etcd [puppet] - 10https://gerrit.wikimedia.org/r/710973 (https://phabricator.wikimedia.org/T288106) [13:15:26] (03PS1) 10JMeybohm: Revert "Add Kubernetes 1.17+ typology annotations" [puppet] - 10https://gerrit.wikimedia.org/r/710709 [13:16:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add Kubernetes 1.17+ typology annotations" [puppet] - 10https://gerrit.wikimedia.org/r/710709 (owner: 10JMeybohm) [13:17:22] (03PS2) 10JMeybohm: Revert "Add Kubernetes 1.17+ typology annotations" [puppet] - 10https://gerrit.wikimedia.org/r/710709 [13:19:20] (03CR) 10JMeybohm: [C: 03+2] Revert "Add Kubernetes 1.17+ typology annotations" [puppet] - 10https://gerrit.wikimedia.org/r/710709 (owner: 10JMeybohm) [13:21:40] (03PS1) 10JMeybohm: kubernetes: Add typology annotations for kubestage200* [puppet] - 10https://gerrit.wikimedia.org/r/710974 [13:23:14] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Add typology annotations for kubestage200* [puppet] - 10https://gerrit.wikimedia.org/r/710974 (owner: 10JMeybohm) [13:23:24] (03CR) 10MMandere: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/710941 (https://phabricator.wikimedia.org/T283660) (owner: 10Ema) [13:26:43] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2128 (re)pooling @ 60%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P16977 and previous config saved to /var/cache/conftool/dbconfig/20210809-132758-root.json [13:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:47] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30518/console" [puppet] - 10https://gerrit.wikimedia.org/r/710941 (https://phabricator.wikimedia.org/T283660) (owner: 10Ema) [13:34:26] (03CR) 10Ema: [V: 03+1 C: 03+2] cache: deploy prometheus varnish exporter after varnish [puppet] - 10https://gerrit.wikimedia.org/r/710941 (https://phabricator.wikimedia.org/T283660) (owner: 10Ema) [13:39:39] (03PS3) 10Ema: trafficserver: ensure sysconfdir exists on default instance [puppet] - 10https://gerrit.wikimedia.org/r/710969 [13:40:45] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30519/console" [puppet] - 10https://gerrit.wikimedia.org/r/710969 (owner: 10Ema) [13:43:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2128 (re)pooling @ 80%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P16978 and previous config saved to /var/cache/conftool/dbconfig/20210809-134301-root.json [13:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:07] (03PS10) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [13:50:23] (03PS11) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [13:50:56] (03PS6) 10Kormat: mariadb: Add specific role for sanitarium masters. [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) [13:51:28] (03CR) 10Ema: [C: 03+2] pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 (owner: 10Ema) [13:52:06] (03Abandoned) 10Ema: pontoon: add acmechief to traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710569 (owner: 10Ema) [13:52:08] (03CR) 10JMeybohm: [C: 03+1] mwdebug: only deploy to nodes with ssd disks [deployment-charts] - 10https://gerrit.wikimedia.org/r/710965 (https://phabricator.wikimedia.org/T288345) (owner: 10Giuseppe Lavagetto) [13:52:27] (03PS2) 10Ema: cache: use confd only if backend list is on etcd [puppet] - 10https://gerrit.wikimedia.org/r/710973 (https://phabricator.wikimedia.org/T288106) [13:52:43] !log disabling puppet on all db hosts for roll-out of T285390 [13:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:51] T285390: Move sanitarium masters to dedicated puppet role - https://phabricator.wikimedia.org/T285390 [13:53:42] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30520/console" [puppet] - 10https://gerrit.wikimedia.org/r/710973 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [13:53:44] (03CR) 10Kormat: [C: 03+2] mariadb: Add specific role for sanitarium masters. [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) (owner: 10Kormat) [13:57:07] jouncebot: now [13:57:07] No deployments scheduled for the next 3 hour(s) and 2 minute(s) [13:57:09] jouncebot: next [13:57:10] In 3 hour(s) and 2 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210809T1700) [13:58:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2128 (re)pooling @ 100%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P16979 and previous config saved to /var/cache/conftool/dbconfig/20210809-135805-root.json [13:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:16] (03PS1) 10Reedy: Update comment about UCoC link progress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710975 (https://phabricator.wikimedia.org/T280886) [13:59:34] (03CR) 10Reedy: [C: 03+2] Update comment about UCoC link progress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710975 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [14:00:22] (03Merged) 10jenkins-bot: Update comment about UCoC link progress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710975 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [14:02:23] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: T280886 UCoC comment update (duration: 00m 58s) [14:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:30] T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886 [14:02:45] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:41] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [14:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:55] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [14:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:04] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [14:05:06] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2063.codfw.wmnet [14:05:07] (03CR) 10JMeybohm: [C: 04-1] profile::gitlab rsync latest backup to passive host (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [14:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:11] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet [14:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:52] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [14:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:53] !log re-enabled (and ran) puppet on all kubernetes nodes - T288345 [14:06:55] (03PS3) 10Ssingh: site: switch doh5002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/710360 [14:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:00] T288345: Only schedule mediawiki pods on nodes with non-spinning disks - https://phabricator.wikimedia.org/T288345 [14:07:06] (03PS3) 10Ema: cache: use confd only if backend list is on etcd [puppet] - 10https://gerrit.wikimedia.org/r/710973 (https://phabricator.wikimedia.org/T288106) [14:07:15] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:07:55] (03CR) 10Ssingh: [C: 03+2] site: switch doh5002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/710360 (owner: 10Ssingh) [14:08:55] (03CR) 10Ema: [C: 03+2] cache: use confd only if backend list is on etcd [puppet] - 10https://gerrit.wikimedia.org/r/710973 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [14:09:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps100[1234].eqiad.wmnet [14:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:43] (03PS1) 10Kormat: cumin: Update db aliases [puppet] - 10https://gerrit.wikimedia.org/r/710976 [14:10:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet [14:10:36] (03CR) 10Ssingh: [C: 03+2] Add doh5002 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/710358 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [14:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:20] (03Merged) 10jenkins-bot: Add doh5002 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/710358 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [14:13:30] (03PS4) 10Ema: trafficserver: ensure sysconfdir exists on default instance [puppet] - 10https://gerrit.wikimedia.org/r/710969 [14:17:32] (03CR) 10Kormat: [C: 03+2] cumin: Update db aliases [puppet] - 10https://gerrit.wikimedia.org/r/710976 (owner: 10Kormat) [14:17:38] !log ran homer for Gerrit 710358: Set up BGP peering to doh5002 in eqsin [14:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:23] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:35] (03CR) 10Jcrespo: [C: 03+1] cumin: Update db aliases [puppet] - 10https://gerrit.wikimedia.org/r/710976 (owner: 10Kormat) [14:23:42] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:28] (03PS1) 10Kormat: Revert "db1136: Disable notifications for reimage." [puppet] - 10https://gerrit.wikimedia.org/r/710710 [14:39:43] (03CR) 10Kormat: [C: 03+2] Revert "db1136: Disable notifications for reimage." [puppet] - 10https://gerrit.wikimedia.org/r/710710 (owner: 10Kormat) [14:40:04] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:05] (03CR) 10Ema: [C: 03+1] "One nit, lgtm otherwise!" [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [14:44:57] (03PS6) 10Jelto: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) [14:49:15] (03PS1) 10Jcrespo: dbbackups: Switch s7 backups from stretch to buster (db1171) [puppet] - 10https://gerrit.wikimedia.org/r/710977 (https://phabricator.wikimedia.org/T288244) [14:50:26] (03PS2) 10Jcrespo: dbbackups: Switch s7 backups from stretch (db1116) to buster (db1171) [puppet] - 10https://gerrit.wikimedia.org/r/710977 (https://phabricator.wikimedia.org/T288244) [14:51:06] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30521/console" [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [14:53:09] (03CR) 10Kormat: [C: 03+1] dbbackups: Switch s7 backups from stretch (db1116) to buster (db1171) [puppet] - 10https://gerrit.wikimedia.org/r/710977 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [14:54:19] (03PS5) 10Hnowlan: postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) [14:55:06] (03PS3) 10Jcrespo: dbbackups: Switch s7 backups from stretch (db1116) to buster (db1171) [puppet] - 10https://gerrit.wikimedia.org/r/710977 (https://phabricator.wikimedia.org/T288244) [14:55:44] (03CR) 10Hnowlan: [V: 03+1] postgresql::user: split HBA configuration into a different define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [14:56:42] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switch s7 backups from stretch (db1116) to buster (db1171) [puppet] - 10https://gerrit.wikimedia.org/r/710977 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [14:58:37] (03PS1) 10Btullis: Bring an-druid1005.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710979 (https://phabricator.wikimedia.org/T255148) [14:59:51] (03PS1) 10Btullis: Add dummy keytabs for new druid nodes [labs/private] - 10https://gerrit.wikimedia.org/r/710980 (https://phabricator.wikimedia.org/T255148) [15:01:44] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:09] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy keytabs for new druid nodes [labs/private] - 10https://gerrit.wikimedia.org/r/710980 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [15:05:18] (03PS1) 10Jcrespo: dbbackups: Switch s7 backups from stretch (db2100) to buster (db2098) [puppet] - 10https://gerrit.wikimedia.org/r/710981 (https://phabricator.wikimedia.org/T288244) [15:07:13] (03CR) 10Jcrespo: [C: 04-1] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/710981 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [15:13:38] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The following units failed: apt-daily-upgrade.service,apt-daily.service,elasticsearch-disable-readahead.service,elasticsearch_6@production-search-eqiad.service,export_smart_data_dump.service,prometheus-debian-version-textfile.service,prometheus-node-exporter-apt.service,prometheus-node-exporter.service,prometheus_intel_microcode.service,prometheus_puppet_agent_ [15:13:38] rvice,systemd-timesyncd.service,wmf_auto_restart_prometheus-node-exporter.service,wmf_auto_restart_systemd-timesyncd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:02] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:22:11] (03CR) 10Kormat: [C: 03+1] dbbackups: Switch s7 backups from stretch (db2100) to buster (db2098) [puppet] - 10https://gerrit.wikimedia.org/r/710981 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [15:27:10] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:35] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2062.codfw.wmnet [15:33:37] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2064.codfw.wmnet [15:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:05] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2065.codfw.wmnet [15:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:40] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/710943 (https://phabricator.wikimedia.org/T286911) (owner: 10Ayounsi) [15:39:06] PROBLEM - Host ms-be2062 is DOWN: PING CRITICAL - Packet loss = 100% [15:39:16] RECOVERY - Host ms-be2062 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [15:41:46] PROBLEM - Host ms-be2064 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:06] RECOVERY - Host ms-be2064 is UP: PING OK - Packet loss = 0%, RTA = 34.27 ms [15:42:23] that's me ^ [15:45:12] PROBLEM - Host ms-be2062 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:12] RECOVERY - Host ms-be2062 is UP: PING OK - Packet loss = 0%, RTA = 31.85 ms [15:47:25] (03PS1) 10Hnowlan: maps: disable cassandra metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/710984 (https://phabricator.wikimedia.org/T186567) [15:49:24] (03CR) 10Legoktm: [C: 03+1] Reduce ten seconds from dispatch max time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710515 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [15:49:48] (03CR) 10Legoktm: [C: 03+2] sre.switchdc.services: Exclude helm-charts, lacking a service IP [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) (owner: 10Legoktm) [15:51:20] PROBLEM - Check systemd state on maps1004 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-metrics-collector.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:58] (03CR) 10Legoktm: [C: 03+2] shellbox: Add new logo, by thcipriani [deployment-charts] - 10https://gerrit.wikimedia.org/r/710597 (owner: 10Legoktm) [15:53:02] (03CR) 10Legoktm: [C: 03+2] shellbox: Disable php-fpm slowlog [deployment-charts] - 10https://gerrit.wikimedia.org/r/710607 (https://phabricator.wikimedia.org/T288315) (owner: 10Legoktm) [15:53:04] (03CR) 10jerkins-bot: [V: 04-1] sre.switchdc.services: Exclude helm-charts, lacking a service IP [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) (owner: 10Legoktm) [15:53:06] RECOVERY - Check systemd state on maps1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:52] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10fgiunchedi) [15:55:28] (03Merged) 10jenkins-bot: shellbox: Add new logo, by thcipriani [deployment-charts] - 10https://gerrit.wikimedia.org/r/710597 (owner: 10Legoktm) [15:55:30] (03Merged) 10jenkins-bot: shellbox: Disable php-fpm slowlog [deployment-charts] - 10https://gerrit.wikimedia.org/r/710607 (https://phabricator.wikimedia.org/T288315) (owner: 10Legoktm) [15:57:51] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [15:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:56] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Add a new PSP for MediaWiki and allow to use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/710926 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [16:00:10] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:00:14] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:15] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' . [16:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:27] (03Merged) 10jenkins-bot: admin_ng: Add a new PSP for MediaWiki and allow to use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/710926 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [16:02:02] (03CR) 10Jelto: [V: 03+1] "@JMeybohm thanks for the review! Could you take a look again? I removed the unused variables and made sure that all resources have the rig" [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [16:02:08] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:30] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:47] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' . [16:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:03] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:27] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:50] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:04] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [16:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:39] great, thanks for testing my stuff legoktm :-p [16:04:49] haha [16:05:18] legoktm: are you planning to deploy to prod as well? [16:05:33] yes [16:05:35] is that oK? [16:05:48] would be nice if you could follow me/my changes than [16:05:57] ok, I'm in no rush [16:06:27] just to make sure I did not break anything...did the deploy to staging pass? [16:06:46] yep [16:06:50] cool! [16:07:01] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:23] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:38] feel free to deploy to eqiad then :) [16:07:57] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [16:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:15] I like how my terminal breaks the ps output as "hellbox-constraints --namespace..." :D [16:09:25] heh [16:09:30] deployed to eqiad, looks good [16:09:42] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:56] (03PS1) 10Hnowlan: cassandra: remove cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/710985 (https://phabricator.wikimedia.org/T186567) [16:10:34] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:38] legoktm: great. I'm done then [16:11:58] cool, deploying to codfw now [16:12:05] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [16:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:24] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:19] thanks! [16:23:31] (03PS1) 10JMeybohm: admin_ng: Switch mwdebug namespace to allow-mediawiki-psp [deployment-charts] - 10https://gerrit.wikimedia.org/r/710986 (https://phabricator.wikimedia.org/T288315) [16:23:33] (03PS1) 10JMeybohm: mediawiki: Add the SYS_PTRACE capability to the php container [deployment-charts] - 10https://gerrit.wikimedia.org/r/710987 (https://phabricator.wikimedia.org/T288315) [16:27:02] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:18] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30525/console" [puppet] - 10https://gerrit.wikimedia.org/r/710985 (https://phabricator.wikimedia.org/T186567) (owner: 10Hnowlan) [16:40:24] (03CR) 10JMeybohm: [C: 04-1] profile::gitlab rsync latest backup to passive host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210809T1700). [17:01:18] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:12] (03CR) 10Btullis: [C: 03+2] Bring an-druid1005.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710979 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [17:09:09] (03PS2) 10Btullis: Bring an-druid1005.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710979 (https://phabricator.wikimedia.org/T255148) [17:09:50] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:10:15] (03CR) 10Btullis: [C: 03+2] Bring an-druid1005.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/710979 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [17:21:48] (03PS2) 10BryanDavis: toolhub: Add CronJob for crawler [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) [17:23:20] (03CR) 10BryanDavis: toolhub: Add CronJob for crawler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [17:23:42] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:41] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10thcipriani) hi @Qgil Bitergia has a lot of the info you're looking for > or have made at least one merged commit to any Wikimedia repos on Gerrit,... [17:36:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: only deploy to nodes with ssd disks [deployment-charts] - 10https://gerrit.wikimedia.org/r/710965 (https://phabricator.wikimedia.org/T288345) (owner: 10Giuseppe Lavagetto) [17:37:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deploy-mwdebug: better handle errors [puppet] - 10https://gerrit.wikimedia.org/r/710940 (owner: 10Giuseppe Lavagetto) [17:39:26] (03CR) 10BryanDavis: toolhub: initial chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [17:39:33] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10RLazarus) Likewise for "Wikimedia server administrators with shell access" we can pull this out of [[ https://gerrit.wikimedia.org/r/plugins/gitiles... [17:39:33] (03Merged) 10jenkins-bot: mwdebug: only deploy to nodes with ssd disks [deployment-charts] - 10https://gerrit.wikimedia.org/r/710965 (https://phabricator.wikimedia.org/T288345) (owner: 10Giuseppe Lavagetto) [17:44:56] (03PS1) 10Giuseppe Lavagetto: mwdebug: also get discovery listeners from puppet [puppet] - 10https://gerrit.wikimedia.org/r/710999 [17:45:24] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: also get discovery listeners from puppet [puppet] - 10https://gerrit.wikimedia.org/r/710999 (owner: 10Giuseppe Lavagetto) [17:45:29] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mwdebug: also get discovery listeners from puppet [puppet] - 10https://gerrit.wikimedia.org/r/710999 (owner: 10Giuseppe Lavagetto) [17:47:07] (03PS2) 10Giuseppe Lavagetto: mwdebug: also get discovery listeners from puppet [puppet] - 10https://gerrit.wikimedia.org/r/710999 [17:48:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: also get discovery listeners from puppet [puppet] - 10https://gerrit.wikimedia.org/r/710999 (owner: 10Giuseppe Lavagetto) [18:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210809T1800) [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:01:52] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:05] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Tgr) >>! In T288455#7270569, @thcipriani wrote: > hi @Qgil Bitergia has a lot of the info you're looking for > >> or have made at least one merged... [18:04:48] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Tgr) [18:26:18] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:02] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:24] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:26:36] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210809T2000). Please do the needful. [20:02:20] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:06] 10SRE, 10Traffic, 10serviceops: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10aborrero) [20:23:09] (03PS1) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) [20:26:44] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:26] 10SRE, 10Traffic, 10serviceops: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10aborrero) [21:00:04] Reedy and sbassett: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210809T2100). [21:02:22] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:26:20] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:42:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:02:08] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:54] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:25:29] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list - wikimediadrc-kinshasa - https://phabricator.wikimedia.org/T288410 (10Legoktm) @CapitainAfrika is there a wiki page that explains your group somewhere? [22:26:12] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:46] PROBLEM - SSH on mw1305.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210809T2300). Please do the needful. [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:16] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:20] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state