[00:01:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P44324 and previous config saved to /var/cache/conftool/dbconfig/20230211-000131-marostegui.json [00:05:05] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: man-db.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:24] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) Dear Juniper Networks Customer, Thank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juniper... [00:12:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2451.mgmt.codfw.wmnet with reboot policy FORCED [00:12:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-data for Fgoodwin - https://phabricator.wikimedia.org/T329404 (10thcipriani) [00:13:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [00:16:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T329203)', diff saved to https://phabricator.wikimedia.org/P44325 and previous config saved to /var/cache/conftool/dbconfig/20230211-001637-marostegui.json [00:16:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [00:16:41] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [00:16:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [00:16:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T329203)', diff saved to https://phabricator.wikimedia.org/P44326 and previous config saved to /var/cache/conftool/dbconfig/20230211-001658-marostegui.json [00:17:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2436.mgmt.codfw.wmnet with reboot policy FORCED [00:17:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2437.mgmt.codfw.wmnet with reboot policy FORCED [00:17:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2436.mgmt.codfw.wmnet with reboot policy FORCED [00:18:00] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2437.mgmt.codfw.wmnet with reboot policy FORCED [00:18:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2436.mgmt.codfw.wmnet with reboot policy FORCED [00:18:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2437.mgmt.codfw.wmnet with reboot policy FORCED [00:19:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T329203)', diff saved to https://phabricator.wikimedia.org/P44327 and previous config saved to /var/cache/conftool/dbconfig/20230211-001914-marostegui.json [00:25:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2436.mgmt.codfw.wmnet with reboot policy FORCED [00:25:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2437.mgmt.codfw.wmnet with reboot policy FORCED [00:27:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2439.mgmt.codfw.wmnet with reboot policy FORCED [00:27:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2440.mgmt.codfw.wmnet with reboot policy FORCED [00:30:23] (03CR) 10Dzahn: [V: 04-1] "parameter 'unit' expects a match for Systemd::Servicename = Pattern[/^[a-zA-Z0-9@:_.\\-]{1,248}\.service$/], got 'phd'" [puppet] - 10https://gerrit.wikimedia.org/r/888274 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [00:31:01] (03CR) 10Dzahn: [V: 04-1] ""mask" and "unmask" have different expectations about the service name?" [puppet] - 10https://gerrit.wikimedia.org/r/888274 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [00:32:22] (03PS2) 10Dzahn: phabricator: stop/disable/mask phd based on phabricator_server setting [puppet] - 10https://gerrit.wikimedia.org/r/888274 (https://phabricator.wikimedia.org/T329285) [00:34:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P44328 and previous config saved to /var/cache/conftool/dbconfig/20230211-003420-marostegui.json [00:34:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2439.mgmt.codfw.wmnet with reboot policy FORCED [00:34:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2440.mgmt.codfw.wmnet with reboot policy FORCED [00:35:11] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: run-dashboards-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2441.mgmt.codfw.wmnet with reboot policy FORCED [00:35:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2442.mgmt.codfw.wmnet with reboot policy FORCED [00:43:19] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/888274/39510/" [puppet] - 10https://gerrit.wikimedia.org/r/888274 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [00:44:04] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:44:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2441.mgmt.codfw.wmnet with reboot policy FORCED [00:44:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2442.mgmt.codfw.wmnet with reboot policy FORCED [00:46:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2443.mgmt.codfw.wmnet with reboot policy FORCED [00:46:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2444.mgmt.codfw.wmnet with reboot policy FORCED [00:49:04] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:49:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P44329 and previous config saved to /var/cache/conftool/dbconfig/20230211-004927-marostegui.json [01:00:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2443.mgmt.codfw.wmnet with reboot policy FORCED [01:00:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2444.mgmt.codfw.wmnet with reboot policy FORCED [01:00:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2445.mgmt.codfw.wmnet with reboot policy FORCED [01:01:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2446.mgmt.codfw.wmnet with reboot policy FORCED [01:04:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T329203)', diff saved to https://phabricator.wikimedia.org/P44330 and previous config saved to /var/cache/conftool/dbconfig/20230211-010433-marostegui.json [01:04:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [01:04:38] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:04:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [01:04:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T329203)', diff saved to https://phabricator.wikimedia.org/P44331 and previous config saved to /var/cache/conftool/dbconfig/20230211-010454-marostegui.json [01:08:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2445.mgmt.codfw.wmnet with reboot policy FORCED [01:08:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2446.mgmt.codfw.wmnet with reboot policy FORCED [01:10:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T329203)', diff saved to https://phabricator.wikimedia.org/P44332 and previous config saved to /var/cache/conftool/dbconfig/20230211-011010-marostegui.json [01:10:15] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:10:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2447.mgmt.codfw.wmnet with reboot policy FORCED [01:10:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2448.mgmt.codfw.wmnet with reboot policy FORCED [01:25:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P44333 and previous config saved to /var/cache/conftool/dbconfig/20230211-012517-marostegui.json [01:27:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2447.mgmt.codfw.wmnet with reboot policy FORCED [01:27:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2448.mgmt.codfw.wmnet with reboot policy FORCED [01:28:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2449.mgmt.codfw.wmnet with reboot policy FORCED [01:28:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2450.mgmt.codfw.wmnet with reboot policy FORCED [01:32:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2450.mgmt.codfw.wmnet with reboot policy FORCED [01:37:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2449.mgmt.codfw.wmnet with reboot policy FORCED [01:37:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2451.mgmt.codfw.wmnet with reboot policy FORCED [01:40:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P44334 and previous config saved to /var/cache/conftool/dbconfig/20230211-014023-marostegui.json [01:41:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2451.mgmt.codfw.wmnet with reboot policy FORCED [01:42:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [01:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T329203)', diff saved to https://phabricator.wikimedia.org/P44335 and previous config saved to /var/cache/conftool/dbconfig/20230211-015530-marostegui.json [01:55:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:55:34] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:55:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:10:46] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:46] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:31] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:47] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:39] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:55] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:35] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [06:13:21] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:04] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:34:04] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:38:47] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:44:13] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 104, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:14:39] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:21:45] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 105, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:41:29] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:48:49] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 104, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:08:23] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:15:29] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 104, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:33:41] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:40:59] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 104, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:58:53] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:53] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 104, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:00:09] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:33] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:50:02] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Physikerwelt) [18:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:54:15] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 2102 MB (3% inode=97%): /srv/swift-storage/sda3 10502 MB (5% inode=99%): /tmp 2102 MB (3% inode=97%): /var/tmp 2102 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [20:10:59] 10SRE-swift-storage, 10Wikimedia-production-error: 503, Backend fetch failed while undeleting files - https://phabricator.wikimedia.org/T328579 (10Aklapper) [20:26:19] disk space issues on Thanos? the fix seems...quite straightforward ;) [21:46:32] (03PS1) 10Majavah: Drop Tomcat support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/888296 (https://phabricator.wikimedia.org/T141396) [21:47:12] (03CR) 10CI reject: [V: 04-1] Drop Tomcat support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/888296 (https://phabricator.wikimedia.org/T141396) (owner: 10Majavah) [21:48:16] (03PS2) 10Majavah: Drop Tomcat support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/888296 (https://phabricator.wikimedia.org/T141396) [22:22:39] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review: role::puppetmaster::standalone clones Git repositories as gitpuppet, git-sync-upstream overwrites them as root - https://phabricator.wikimedia.org/T152059 (10taavi)