[00:03:20] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: civi2001 - https://phabricator.wikimedia.org/T397380#10952576 (10Dwisehaupt) [00:04:22] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#10952580 (10Dwisehaupt) a:05Dwisehaupt→03None [00:08:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164322 [00:08:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164322 (owner: 10TrainBranchBot) [00:09:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10952585 (10Dwisehaupt) 05Open→03Resolved @Jgreen Was able to bring the bond interface up on pay-lb1002 and it has survi... [00:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:54:52] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983#10952612 (10phaultfinder) [00:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:01:01] (03CR) 10Andrew Bogott: [C:03+2] Openstack glance: switch from novaadmin to 'glance' service user [puppet] - 10https://gerrit.wikimedia.org/r/1164303 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [01:10:02] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164322 (owner: 10TrainBranchBot) [01:27:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:29:21] (03PS3) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [02:36:23] (03PS4) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [02:45:27] (03CR) 10CI reject: [V:04-1] dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [03:06:20] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3631 MB (3% inode=98%): /tmp 3631 MB (3% inode=98%): /var/tmp 3631 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [03:11:09] (03PS5) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [03:20:40] (03CR) 10CI reject: [V:04-1] dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [03:21:51] (03PS6) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [03:28:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:30:46] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1164186 (owner: 10L10n-bot) [03:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:46:20] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3558 MB (3% inode=98%): /tmp 3558 MB (3% inode=98%): /var/tmp 3558 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [04:08:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:26:20] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3492 MB (3% inode=98%): /tmp 3492 MB (3% inode=98%): /var/tmp 3492 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [04:57:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:00:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:05:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:07:41] (03PS1) 10Ilias Sarantopoulos: amd-pytorch21: delete torch 2.1.2 + ROCm 5.6 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1164329 [05:12:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:17:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:18:33] (03CR) 10Giuseppe Lavagetto: [C:04-1] "Please fix the commit message, for the sake of our future selves." [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [05:27:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:27:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:28:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:32:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:33:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:45:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:50:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250627T0600) [06:00:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:10:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:11:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:15:57] 10ops-eqiad, 06DC-Ops: Outbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://phabricator.wikimedia.org/T398006 (10phaultfinder) 03NEW [06:17:03] (03CR) 10Kevin Bazira: [C:03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1164329 (owner: 10Ilias Sarantopoulos) [06:30:30] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10952844 (10jcrespo) I leave you with some homework meanwhile: T387833#10952842 [06:31:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:35:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:38:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:43:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:46:04] (03PS1) 10Muehlenhoff: Allow passing multiple debmonitor servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1164333 (https://phabricator.wikimedia.org/T397696) [06:46:30] (03CR) 10CI reject: [V:04-1] Allow passing multiple debmonitor servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1164333 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [06:46:38] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: decommission frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163851 (https://phabricator.wikimedia.org/T397868) (owner: 10Dwisehaupt) [06:47:30] (03CR) 10Brouberol: "@btullis@wikimedia.org do we need to reflect this change in Kubernetes, or is that server-side only?" [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) (owner: 10Btullis) [06:48:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:51:28] (03PS2) 10Muehlenhoff: Allow passing multiple debmonitor servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1164333 (https://phabricator.wikimedia.org/T397696) [06:56:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164333 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250627T0700) [07:06:28] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 195759080 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:07:30] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2342960 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:10:21] (03PS3) 10Muehlenhoff: Allow passing multiple debmonitor servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1164333 (https://phabricator.wikimedia.org/T397696) [07:11:30] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#10952973 (10Volans) When editing netbox DNS records please always make sure to run the `sre.dns.netbox` cookbook as otherwise there ar... [07:13:30] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#10952976 (10Volans) There are currently changes to remove: ` frpig2001.mgmt.frack.codfw.wmnet pay-lvs2001.mgmt.frack.codfw.wmnet pay-l... [07:15:39] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Update [07:15:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164333 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:19:38] (03PS2) 10Fabfur: cache,haproxy: use http-after-response capture for x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [07:20:20] (03CR) 10Fabfur: cache,haproxy: use http-after-response capture for x-analytics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [07:22:30] (03CR) 10Volans: "Suggested simplification inline" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [07:23:42] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164333 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:25:00] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10953025 (10Jelto) Thanks for the update @jcrespo ! T387833#10952842 should be unrelated to the efforts of migrating GitLab to obj... [07:25:10] (03PS1) 10Stevemunene: zookeeper: decommission an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1164337 (https://phabricator.wikimedia.org/T398013) [07:26:19] !log depool cp7007 for testing (T397917) [07:26:23] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [07:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:24] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [07:26:54] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (dbprov1006), No backups: 7 (dbprov1003, ...), Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:29:03] (03PS3) 10Fabfur: cache,haproxy: use http-after-response capture for x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [07:29:06] (03PS1) 10Stevemunene: replace decommissioned an-conf host [alerts] - 10https://gerrit.wikimedia.org/r/1164338 (https://phabricator.wikimedia.org/T398013) [07:32:30] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 492224248 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:34:07] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [07:34:30] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 84664 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:34:44] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [07:35:30] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:38:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:53] !log deploying debmonitor-client v0.5.0 fleet-wide [07:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:46:20] (03PS1) 10Phuedx: [analytics][refine]: Stop refining TwoColConflict* legacy EventLogging streams [puppet] - 10https://gerrit.wikimedia.org/r/1164356 [07:46:58] (03PS2) 10Phuedx: [analytics][refine]: Stop refining TwoColConflict* legacy EventLogging streams [puppet] - 10https://gerrit.wikimedia.org/r/1164356 (https://phabricator.wikimedia.org/T397611) [07:49:45] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] [analytics][refine]: Stop refining TwoColConflict* legacy EventLogging streams [puppet] - 10https://gerrit.wikimedia.org/r/1164356 (https://phabricator.wikimedia.org/T397611) (owner: 10Phuedx) [07:58:54] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:59:00] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [07:59:33] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:44] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:02:17] (03PS1) 10Phuedx: ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164388 (https://phabricator.wikimedia.org/T397611) [08:02:50] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [08:03:31] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:04:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164388 (https://phabricator.wikimedia.org/T397611) (owner: 10Phuedx) [08:04:33] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:06:50] (03CR) 10Brouberol: [C:03+1] replace decommissioned an-conf host [alerts] - 10https://gerrit.wikimedia.org/r/1164338 (https://phabricator.wikimedia.org/T398013) (owner: 10Stevemunene) [08:06:58] (03CR) 10Brouberol: [C:03+1] zookeeper: decommission an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1164337 (https://phabricator.wikimedia.org/T398013) (owner: 10Stevemunene) [08:07:14] (03CR) 10Brouberol: [C:03+1] hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene) [08:08:09] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [08:08:20] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [08:10:38] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [08:10:40] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [08:10:49] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [08:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:17:17] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [08:17:20] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [08:17:37] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [08:19:27] !log stevemunene@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-conf1001.eqiad.wmnet [08:19:40] (03PS1) 10Jelto: gitlab: disable nftables prometheus exporter script in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1164389 (https://phabricator.wikimedia.org/T396622) [08:19:57] (03CR) 10Stevemunene: [C:03+2] zookeeper: decommission an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1164337 (https://phabricator.wikimedia.org/T398013) (owner: 10Stevemunene) [08:23:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:25:20] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1164390 (https://phabricator.wikimedia.org/T398014) [08:25:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:26:05] !log stevemunene@cumin1003 START - Cookbook sre.dns.netbox [08:28:20] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [08:28:40] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [08:30:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:31:38] stevemunene@cumin1003 decommission (PID 3428784) is awaiting input [08:31:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:33:53] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6094/console" [puppet] - 10https://gerrit.wikimedia.org/r/1164389 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [08:34:32] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [08:34:48] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [08:36:38] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6095/console" [puppet] - 10https://gerrit.wikimedia.org/r/1164389 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [08:36:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:37:24] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#10953179 (10Stevemunene) >>! In T397868#10952976, @Volans wrote: > There are currently changes to remove: > ` > frpig2001.... [08:38:32] (03PS1) 10Muehlenhoff: Fix typo in LDAP record [puppet] - 10https://gerrit.wikimedia.org/r/1164392 [08:38:50] (03CR) 10Bartosz Wójtowicz: "Leaving a small comment from my side, I'm happy to approve once we resolve it! 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164271 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [08:38:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:40:52] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [08:41:17] (03CR) 10Btullis: [C:03+1] replace decommissioned an-conf host [alerts] - 10https://gerrit.wikimedia.org/r/1164338 (https://phabricator.wikimedia.org/T398013) (owner: 10Stevemunene) [08:42:16] (03CR) 10Stevemunene: [C:03+2] replace decommissioned an-conf host [alerts] - 10https://gerrit.wikimedia.org/r/1164338 (https://phabricator.wikimedia.org/T398013) (owner: 10Stevemunene) [08:42:17] (03CR) 10Btullis: [C:03+1] hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene) [08:43:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:43:58] (03Merged) 10jenkins-bot: replace decommissioned an-conf host [alerts] - 10https://gerrit.wikimedia.org/r/1164338 (https://phabricator.wikimedia.org/T398013) (owner: 10Stevemunene) [08:44:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:48:12] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6096/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [08:48:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:49:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:52:03] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [08:54:30] (03CR) 10Muehlenhoff: [C:03+2] Fix typo in LDAP record [puppet] - 10https://gerrit.wikimedia.org/r/1164392 (owner: 10Muehlenhoff) [08:56:32] !log Publish new version of Add Link datasets for enwiki (T386867) [08:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:39] T386867: Add a Link: add "do not link" rule for country names (Q6256) on English Wikipedia - https://phabricator.wikimedia.org/T386867 [08:58:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [09:01:37] (03PS8) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [09:08:19] (03PS1) 10Elukey: profile::docker::reporter: exclude /repos/releng/zuul/zuul/nodepool-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1164393 [09:10:32] (03PS8) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [09:11:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164393 (owner: 10Elukey) [09:16:47] (03CR) 10Muehlenhoff: [C:03+2] Allow passing multiple debmonitor servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1164333 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [09:16:57] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Follow up on lists.wm.o TLS usage - https://phabricator.wikimedia.org/T398018 (10Vgutierrez) 03NEW [09:17:46] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Follow up on lists.wm.o TLS usage - https://phabricator.wikimedia.org/T398018#10953279 (10Vgutierrez) p:05Triage→03Medium a:05Vgutierrez→03None [09:18:44] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: exclude /repos/releng/zuul/zuul/nodepool-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1164393 (owner: 10Elukey) [09:19:33] (03CR) 10Volans: WIP: netbox-snippets test cookbook to get started (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [09:24:29] (03PS4) 10Muehlenhoff: sretest: report to both debmonitor servers [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:24:40] (03PS5) 10Muehlenhoff: sretest: report to both debmonitor servers [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:25:24] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:25:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:26:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:27:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:21] (03CR) 10Muehlenhoff: [C:03+2] sretest: report to both debmonitor servers [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:34:31] (03PS1) 10Vgutierrez: exim: Start using ECDSA certificate on mx-in [puppet] - 10https://gerrit.wikimedia.org/r/1164397 (https://phabricator.wikimedia.org/T398019) [09:35:43] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [09:35:50] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164397 (https://phabricator.wikimedia.org/T398019) (owner: 10Vgutierrez) [09:36:55] (03PS2) 10AikoChou: ml-services: update edit-check image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164271 (https://phabricator.wikimedia.org/T397013) [09:38:03] (03PS1) 10Elukey: admin_ng: create clusterrole and binding for debmonitor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164398 (https://phabricator.wikimedia.org/T397696) [09:39:53] (03PS58) 10Cathal Mooney: sre.dns.netbox-records cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [09:41:12] taavi@cumin1003 netbox (PID 3436949) is awaiting input [09:41:27] (03CR) 10Elukey: [C:03+1] Move docker-report from build2001 to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1164219 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [09:42:11] (03CR) 10Cathal Mooney: sre.dns.netbox-records cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [09:42:29] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164397 (https://phabricator.wikimedia.org/T398019) (owner: 10Vgutierrez) [09:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:43:08] (03CR) 10AikoChou: ml-services: update edit-check image in experimental ns (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164271 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [09:43:11] (03PS1) 10Majavah: P:toolforge::proxy: api: Use ec-prime256v1 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1164399 (https://phabricator.wikimedia.org/T375569) [09:43:19] (03CR) 10JMeybohm: [C:03+1] admin_ng: create clusterrole and binding for debmonitor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164398 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:44:39] !log restart swift-object-replicator on ms-be2077 [09:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:32] (03CR) 10CI reject: [V:04-1] sre.dns.netbox-records cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [09:47:45] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [09:48:02] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [09:48:07] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Thank you for this work! Approving from my side :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164271 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [09:48:41] (03CR) 10WMDE-Fisch: [C:03+1] [analytics][refine]: Stop refining TwoColConflict* legacy EventLogging streams [puppet] - 10https://gerrit.wikimedia.org/r/1164356 (https://phabricator.wikimedia.org/T397611) (owner: 10Phuedx) [09:49:23] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [09:49:37] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [09:50:00] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983#10953391 (10phaultfinder) [09:53:34] (03PS59) 10Cathal Mooney: sre.dns.netbox-records cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [09:55:57] !log taavi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:58:15] !log stevemunene@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:58:16] !log stevemunene@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-conf1001.eqiad.wmnet [10:00:11] (03PS9) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [10:00:19] (03CR) 10CI reject: [V:04-1] sre.dns.netbox-records cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [10:00:37] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-conf[1002-1003].eqiad.wmnet with reason: Awaiting decommissioning [10:00:51] (03CR) 10Jcrespo: [C:03+1] "Las backup running, this can go now." [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [10:01:20] (03CR) 10Jcrespo: [C:03+2] dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [10:01:35] (03CR) 10Elukey: [C:03+2] admin_ng: create clusterrole and binding for debmonitor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164398 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:01:45] (03CR) 10FNegri: [C:03+1] P:toolforge::proxy: api: Use ec-prime256v1 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1164399 (https://phabricator.wikimedia.org/T375569) (owner: 10Majavah) [10:01:52] (03CR) 10Jcrespo: [C:03+2] "*Last" [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [10:02:30] (03PS60) 10Cathal Mooney: sre.dns.netbox-records cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [10:03:00] (03CR) 10Gkyziridis: [C:03+1] "Thnx for working on that Aiko!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164271 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [10:03:17] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: api: Use ec-prime256v1 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1164399 (https://phabricator.wikimedia.org/T375569) (owner: 10Majavah) [10:05:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:05:21] (03PS4) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [10:05:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:05:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:09:42] (03Abandoned) 10Vgutierrez: exim: Start using ECDSA certificate on mx-in [puppet] - 10https://gerrit.wikimedia.org/r/1164397 (https://phabricator.wikimedia.org/T398019) (owner: 10Vgutierrez) [10:10:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:10:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:13:02] (03PS1) 10Muehlenhoff: On build2002 only submit to debmonitor-next [puppet] - 10https://gerrit.wikimedia.org/r/1164410 (https://phabricator.wikimedia.org/T397696) [10:15:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:19:17] (03CR) 10Elukey: [C:03+1] On build2002 only submit to debmonitor-next [puppet] - 10https://gerrit.wikimedia.org/r/1164410 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [10:22:21] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10953459 (10MatthewVernon) Using `test-cookbook` and the currently-in-review check-dbs cookbook on all the thumbnail container dbs, we find th... [10:22:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [10:24:05] (03CR) 10Muehlenhoff: [C:03+2] On build2002 only submit to debmonitor-next [puppet] - 10https://gerrit.wikimedia.org/r/1164410 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [10:24:38] (03CR) 10Stevemunene: [C:03+2] hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene) [10:25:46] (03PS1) 10Vgutierrez: acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) [10:25:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:26:03] (03PS1) 10Fabfur: data: removal of unneeded volunteers from analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1164419 (https://phabricator.wikimedia.org/T397850) [10:26:25] (03CR) 10CI reject: [V:04-1] acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [10:27:02] (03PS1) 10Muehlenhoff: Bump access by two weeks [puppet] - 10https://gerrit.wikimedia.org/r/1164420 [10:27:29] (03CR) 10AOkoth: [C:03+2] os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:27:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:27:56] (03PS1) 10Elukey: profile::docker::reporter: add K8s credentials for demonitor [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) [10:29:38] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6097/co" [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:30:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:31:02] (03PS2) 10Elukey: profile::docker::reporter: add K8s credentials for demonitor [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) [10:32:43] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:34:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:34:25] (03PS3) 10AOkoth: doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) [10:34:51] (03PS3) 10Elukey: profile::docker::reporter: add K8s credentials for demonitor [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) [10:34:55] ayounsi@cumin2002 reimage (PID 2540181) is awaiting input [10:35:42] (03PS1) 10Hnowlan: changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) [10:36:22] (03PS2) 10Vgutierrez: acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) [10:36:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:36:57] (03CR) 10AikoChou: [C:03+2] ml-services: update edit-check image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164271 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [10:37:08] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [10:37:11] (03PS61) 10Cathal Mooney: sre.dns.netbox-records cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [10:37:53] (03CR) 10Clément Goubert: [C:03+1] changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [10:38:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:38:39] (03Merged) 10jenkins-bot: ml-services: update edit-check image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164271 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [10:39:08] (03PS4) 10Elukey: profile::docker::reporter: add K8s credentials for demonitor [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) [10:39:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:39:17] (03PS9) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [10:39:32] (03CR) 10Elukey: profile::docker::reporter: add K8s credentials for demonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:40:29] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:40:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:41:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:42:44] (03CR) 10Volans: profile::docker::reporter: add K8s credentials for demonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:43:46] (03PS1) 10Vgutierrez: hiera: stop issuing mx cert [puppet] - 10https://gerrit.wikimedia.org/r/1164423 [10:44:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:46:08] (03CR) 10Fabfur: [C:03+1] hiera: stop issuing mx cert [puppet] - 10https://gerrit.wikimedia.org/r/1164423 (owner: 10Vgutierrez) [10:46:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:47:26] (03CR) 10Vgutierrez: [C:03+2] hiera: stop issuing mx cert [puppet] - 10https://gerrit.wikimedia.org/r/1164423 (owner: 10Vgutierrez) [10:47:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, but also adding Jesse to sanity-check" [puppet] - 10https://gerrit.wikimedia.org/r/1164423 (owner: 10Vgutierrez) [10:48:03] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:48:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:49:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:50:45] (03PS1) 10Ayounsi: Redfish get_primary_mac() - extra error handling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164424 [10:52:01] 10ops-codfw, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T398024 (10phaultfinder) 03NEW [10:53:32] (03PS2) 10Ayounsi: Redfish get_primary_mac() - extra error handling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164424 [10:53:54] (03PS5) 10Elukey: profile::docker::reporter: add K8s credentials for demonitor [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) [10:54:38] (03PS3) 10Ayounsi: Redfish get_primary_mac() - extra error handling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164424 [10:54:38] (03PS5) 10Kosta Harlan: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [10:55:06] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6100/co" [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:55:27] (03CR) 10Cathal Mooney: sre.dns.netbox-records cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [10:55:28] (03CR) 10CI reject: [V:04-1] temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [10:58:27] (03CR) 10Elukey: [V:03+1] profile::docker::reporter: add K8s credentials for demonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:59:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250627T0700) [11:00:05] jelto, arnoldokoth, and mutante: Your horoscope predicts another GitLab version upgrades deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250627T1100). [11:00:09] (03PS1) 10Btullis: Fix typo in the thanos_test catalog config for an-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/1164425 (https://phabricator.wikimedia.org/T347430) [11:02:20] (03CR) 10Stevemunene: [C:03+1] Fix typo in the thanos_test catalog config for an-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/1164425 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [11:03:14] !log elukey@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:04:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:04:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:04:59] !log elukey@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:06:15] (03PS2) 10Hnowlan: changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) [11:06:22] !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:07:36] !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:07:47] (03CR) 10Volans: sre.dns.netbox-records cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [11:09:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:09:36] !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:10:25] (03Abandoned) 10Ayounsi: Redfish get_primary_mac() - extra error handling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164424 (owner: 10Ayounsi) [11:10:37] aokoth@cumin1002 aokoth: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:11:32] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [11:12:16] !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [11:12:32] (03CR) 10Elukey: [V:03+1 C:03+2] profile::docker::reporter: add K8s credentials for demonitor [puppet] - 10https://gerrit.wikimedia.org/r/1164421 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [11:13:30] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [11:15:11] (03CR) 10Alexandros Kosiaris: [C:03+1] changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [11:18:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:18:20] (03CR) 10Alexandros Kosiaris: [C:04-1] "LGTM, minor comment inline." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [11:22:15] (03CR) 10Muehlenhoff: [C:03+1] data: removal of unneeded volunteers from analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1164419 (https://phabricator.wikimedia.org/T397850) (owner: 10Fabfur) [11:22:21] (03PS3) 10Hnowlan: changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) [11:22:45] (03CR) 10Hnowlan: "This incorporates a code change so I'll hold til Monday" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [11:22:53] !log aokoth@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Update [11:23:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:23:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:25] (03CR) 10Muehlenhoff: [C:03+2] Bump access by two weeks [puppet] - 10https://gerrit.wikimedia.org/r/1164420 (owner: 10Muehlenhoff) [11:24:34] (03CR) 10Cathal Mooney: sre.dns.netbox-records cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [11:27:14] (03CR) 10Ayounsi: [C:03+1] "neat, I've spent quite some time trying to get something like that to work :)" [puppet] - 10https://gerrit.wikimedia.org/r/1164315 (owner: 10JHathaway) [11:27:47] !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [11:28:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:06] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [11:38:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:07] (03CR) 10Btullis: [C:03+2] Fix typo in the thanos_test catalog config for an-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/1164425 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [11:42:40] (03CR) 10Ayounsi: [C:03+1] "wow, nice!!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [11:45:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:48:47] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [11:50:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:52:18] (03PS3) 10Kamila Součková: Add fake hcaptcha proxy secrets. [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T381265) [11:54:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:55:17] (03PS1) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T381265) [11:56:30] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [11:59:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:00:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1164264 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [12:04:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:06:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:06:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:11:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:16:47] (03CR) 10Fabfur: [C:03+2] data: removal of unneeded volunteers from analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1164419 (https://phabricator.wikimedia.org/T397850) (owner: 10Fabfur) [12:19:49] (03Abandoned) 10Sbisson: CX instrumentation: Fix translation providers in desktop editor events [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163845 (https://phabricator.wikimedia.org/T395493) (owner: 10Sbisson) [12:24:12] (03PS1) 10Muehlenhoff: Remove Kerberos for two users [puppet] - 10https://gerrit.wikimedia.org/r/1164444 (https://phabricator.wikimedia.org/T397850) [12:31:19] RECOVERY - Host wikikube-worker1243 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [12:34:08] (03PS2) 10Ayounsi: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [12:34:09] (03PS1) 10Ayounsi: reimage: merge UUID and MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/1164446 [12:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:40:57] (03CR) 10Muehlenhoff: [C:03+2] Remove Kerberos for two users [puppet] - 10https://gerrit.wikimedia.org/r/1164444 (https://phabricator.wikimedia.org/T397850) (owner: 10Muehlenhoff) [12:41:30] (03CR) 10CI reject: [V:04-1] reimage: merge UUID and MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/1164446 (owner: 10Ayounsi) [12:41:44] (03PS10) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [12:43:19] (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [12:43:42] (03CR) 10Volans: "Minor suggestions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [12:48:05] (03PS1) 10Muehlenhoff: Remove Kerberos for two users [puppet] - 10https://gerrit.wikimedia.org/r/1164447 [12:49:15] (03PS1) 10Vgutierrez: haproxy,varnish: Introduce a host independent healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) [12:51:42] (03CR) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [12:54:14] (03CR) 10Muehlenhoff: [C:03+2] Remove Kerberos for two users [puppet] - 10https://gerrit.wikimedia.org/r/1164447 (owner: 10Muehlenhoff) [12:54:35] RECOVERY - Host wikikube-worker1069 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [12:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:56:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:57:18] (03CR) 10Vgutierrez: cache,haproxy: refactor haproxy captures to fix x-analytics logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [12:58:26] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [12:58:49] (03CR) 10Vgutierrez: "varnish tests are happy for both text & upload" [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [12:59:33] (03PS1) 10Volans: debmonitor: fix alerts for -next [puppet] - 10https://gerrit.wikimedia.org/r/1164450 (https://phabricator.wikimedia.org/T397696) [13:01:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1164450 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:02:14] (03PS2) 10Ayounsi: reimage: merge UUID and MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/1164446 [13:04:56] (03CR) 10Volans: [C:03+2] debmonitor: fix alerts for -next [puppet] - 10https://gerrit.wikimedia.org/r/1164450 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:05:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:08:29] (03CR) 10CI reject: [V:04-1] reimage: merge UUID and MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/1164446 (owner: 10Ayounsi) [13:09:37] (03CR) 10Cathal Mooney: sre.dns.netbox-records cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:10:14] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [13:10:45] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [13:11:24] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [13:11:40] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [13:15:35] (03CR) 10Cathal Mooney: sre.dns.netbox-records cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:15:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:19:29] (03CR) 10Bking: [C:03+2] elastic/cirrussearch: re-enable monitoring for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151256 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:19:33] FIRING: [2x] ProbeDown: Service debmonitor2003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:51] (03CR) 10Bking: [C:03+2] "self merging as this should've been re-enabled already" [puppet] - 10https://gerrit.wikimedia.org/r/1151256 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:20:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:27:07] (03PS5) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [13:27:28] (03CR) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:27:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:28] (03CR) 10JHathaway: [C:03+2] dhcpd: add pxe-client-id [puppet] - 10https://gerrit.wikimedia.org/r/1164315 (owner: 10JHathaway) [13:30:24] (03PS11) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [13:32:15] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:37:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#10954105 (10Jclark-ctr) Attempted to perform BIOS and iDRAC updates, but both failed. Dell Support requested a flea power drain. After performing... [13:37:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10954106 (10Jclark-ctr) Attempted to perform BIOS and iDRAC updates, but both failed. Dell Support requested a flea power drain. After performing t... [13:38:03] (03PS12) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [13:38:19] (03CR) 10Vgutierrez: [C:04-1] cache,haproxy: refactor haproxy captures to fix x-analytics logging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:38:45] (03Abandoned) 10Btullis: mediawiki: Use the servergroup to configure the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [13:38:53] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10954108 (10MoritzMuehlenhoff) [13:40:03] (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [13:42:38] (03PS13) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [13:44:04] (03PS6) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [13:44:21] (03CR) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:44:33] (03PS3) 10Ssingh: prometheus: add dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) [13:44:58] (03CR) 10Ssingh: prometheus: add dnsbox_service_state_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:45:29] (03CR) 10CI reject: [V:04-1] prometheus: add dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:46:25] (03PS4) 10Ssingh: prometheus: add dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) [13:47:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:48:27] (03PS2) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) [13:51:14] (03PS1) 10Hnowlan: wmnet: add discovery records for thumbor [dns] - 10https://gerrit.wikimedia.org/r/1164457 (https://phabricator.wikimedia.org/T397618) [13:53:15] (03PS1) 10Hnowlan: service: add discovery active/active config [puppet] - 10https://gerrit.wikimedia.org/r/1164458 (https://phabricator.wikimedia.org/T397618) [13:54:55] (03CR) 10JHathaway: [C:03+1] "looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164446 (owner: 10Ayounsi) [14:01:55] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [14:02:18] 10SRE-tools, 06cloud-services-team, 06Infrastructure-Foundations: sre.hosts.decommission often leaves dangling things in netbox - https://phabricator.wikimedia.org/T398052 (10Andrew) 03NEW [14:02:39] 10SRE-tools, 06cloud-services-team, 06Infrastructure-Foundations: sre.hosts.decommission often leaves dangling things in netbox - https://phabricator.wikimedia.org/T398052#10954206 (10Andrew) [14:03:00] (03PS7) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [14:03:02] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#10954210 (10Andrew) For whoever applies these pending changes... there are now also (non-urgent) pending changes for cloud... [14:03:26] (03CR) 10Ssingh: haproxy,varnish: Introduce a host independent healthcheck (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [14:04:36] (03CR) 10Volans: "Left some possible simplification comments, nothing is a blocker though." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [14:05:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:07:20] (03PS7) 10Ayounsi: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [14:08:21] (03CR) 10JHathaway: dhcp: add a UUID based DHCP config (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [14:10:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:12:41] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [14:12:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:13:04] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [14:13:11] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10954234 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm executed with errors: - sretest2005 (... [14:13:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:14:33] (03CR) 10Xcollazo: [C:03+1] "Balthazar, all Spark jobs will pick this up via `SPARK_CONF_DIR`, which is set via Airflow config, thus independent of k8s." [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) (owner: 10Btullis) [14:14:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:15:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:16:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:16:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:17:13] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:18:23] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [14:23:31] jhancock@cumin1003 provision (PID 3463469) is awaiting input [14:24:19] jhancock@cumin2002 provision (PID 2585182) is awaiting input [14:28:18] (03PS1) 10Jhancock.wm: Adding and Updating sretest hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1164459 (https://phabricator.wikimedia.org/T396365) [14:28:35] (03PS6) 10Tchanders: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) [14:28:38] jhancock@cumin2002 provision (PID 2585182) is awaiting input [14:28:39] jhancock@cumin1003 provision (PID 3463469) is awaiting input [14:28:51] (03CR) 10Tchanders: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [14:29:22] (03CR) 10Jhancock.wm: "@rob@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1164459 (https://phabricator.wikimedia.org/T396365) (owner: 10Jhancock.wm) [14:29:30] (03CR) 10CI reject: [V:04-1] temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [14:29:33] FIRING: [6x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:54] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [14:29:57] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:30:01] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [14:30:21] (03CR) 10RobH: [C:03+2] Adding and Updating sretest hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1164459 (https://phabricator.wikimedia.org/T396365) (owner: 10Jhancock.wm) [14:30:38] (03PS2) 10Jhancock.wm: Adding and Updating sretest hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1164459 (https://phabricator.wikimedia.org/T396365) [14:30:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:30:47] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:30:51] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [14:30:52] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:31:03] (03CR) 10RobH: [C:03+2] Adding and Updating sretest hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1164459 (https://phabricator.wikimedia.org/T396365) (owner: 10Jhancock.wm) [14:33:18] (03PS8) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [14:33:30] (03Abandoned) 10Jhancock.wm: Adding and Updating sretest hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1164459 (https://phabricator.wikimedia.org/T396365) (owner: 10Jhancock.wm) [14:33:31] FIRING: [6x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:52] (03CR) 10Jcrespo: [C:03+2] Revert "bacula: Create a temporary backup job for long term Archival" [puppet] - 10https://gerrit.wikimedia.org/r/1164302 (owner: 10Jcrespo) [14:35:57] (03PS3) 10Jcrespo: Revert "bacula: Create a temporary backup job for long term Archival" [puppet] - 10https://gerrit.wikimedia.org/r/1164302 [14:36:59] (03PS1) 10Jhancock.wm: Updating and Adding sretest hosts to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1164460 (https://phabricator.wikimedia.org/T396365) [14:38:19] (03CR) 10Jcrespo: [C:03+2] Revert "bacula: Create a temporary backup job for long term Archival" [puppet] - 10https://gerrit.wikimedia.org/r/1164302 (owner: 10Jcrespo) [14:40:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:41:45] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:42:18] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [14:42:28] (03CR) 10RobH: [C:03+2] Updating and Adding sretest hosts to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1164460 (https://phabricator.wikimedia.org/T396365) (owner: 10Jhancock.wm) [14:43:49] !log start full-cluster reindex operations for cirrussearch eqiad/codfw/cloudelastic clusters [14:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:36] ACKNOWLEDGEMENT - Backup freshness on backup1014 is CRITICAL: Stale: 1 (backup1013), No backups: 7 (dbprov1003, ...), Fresh: 142 jobs Jcrespo expected until monday backup run https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:46:41] (03PS8) 10Ayounsi: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [14:47:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:48:14] (03PS3) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 [14:50:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:51:45] (03PS9) 10Fabfur: cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [14:52:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:53:27] (03CR) 10Ssingh: haproxy,varnish: Introduce a host independent healthcheck (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [14:54:35] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [14:54:46] (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [14:54:55] (03CR) 10Ssingh: "First pass: very nice work and thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [14:56:16] (03PS1) 10Volans: debmonitor: add simple auth-check endpoint [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164463 (https://phabricator.wikimedia.org/T397696) [14:57:16] !log configuration on cp7007 reverted and host repooled (T397917) [14:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:23] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [14:58:00] (03PS1) 10RobH: sretest updates [puppet] - 10https://gerrit.wikimedia.org/r/1164464 (https://phabricator.wikimedia.org/T396365) [14:58:04] (03PS2) 10Vgutierrez: haproxy,varnish: Introduce a host independent healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) [14:58:18] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [14:58:19] (03CR) 10Vgutierrez: haproxy,varnish: Introduce a host independent healthcheck (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [14:58:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164463 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [14:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:59:34] (03CR) 10Ssingh: [C:03+1] haproxy,varnish: Introduce a host independent healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:00:15] (03PS14) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [15:00:46] (03CR) 10Volans: [C:03+2] debmonitor: add simple auth-check endpoint [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164463 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [15:01:40] (03PS1) 10Vgutierrez: service: Target upload.wm.o on upload-https healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1164466 (https://phabricator.wikimedia.org/T394484) [15:02:01] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164466 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:02:16] (03Abandoned) 10RobH: sretest updates [puppet] - 10https://gerrit.wikimedia.org/r/1164464 (https://phabricator.wikimedia.org/T396365) (owner: 10RobH) [15:03:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164466 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:03:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:44] (03PS1) 10RobH: fixing sretest [puppet] - 10https://gerrit.wikimedia.org/r/1164467 (https://phabricator.wikimedia.org/T396365) [15:05:05] (03CR) 10RobH: [C:03+2] fixing sretest [puppet] - 10https://gerrit.wikimedia.org/r/1164467 (https://phabricator.wikimedia.org/T396365) (owner: 10RobH) [15:05:09] (03CR) 10Ssingh: [C:03+1] service: Target upload.wm.o on upload-https healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1164466 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:05:20] (03CR) 10RobH: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164467 (https://phabricator.wikimedia.org/T396365) (owner: 10RobH) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:17] (03CR) 10Eevans: [C:03+2] sessionstore2006: preseed d-i for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1164307 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:08:32] (03CR) 10Fabfur: [C:04-2] "pls merge it on monday" [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [15:10:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:10:44] (03Merged) 10jenkins-bot: debmonitor: add simple auth-check endpoint [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164463 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [15:11:58] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:16:33] (03PS1) 10Zabe: categorylinks: Set testwiki to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164472 (https://phabricator.wikimedia.org/T397912) [15:16:34] (03PS1) 10RobH: sretest preseed update [puppet] - 10https://gerrit.wikimedia.org/r/1164471 (https://phabricator.wikimedia.org/T396365) [15:16:35] (03PS9) 10Ayounsi: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [15:16:35] (03PS1) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:44] (03CR) 10Zabe: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164472 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [15:16:59] (03PS62) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [15:18:28] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:18:55] (03PS2) 10RobH: sretest preseed update [puppet] - 10https://gerrit.wikimedia.org/r/1164471 (https://phabricator.wikimedia.org/T396365) [15:19:27] (03CR) 10RobH: [C:03+2] sretest preseed update [puppet] - 10https://gerrit.wikimedia.org/r/1164471 (https://phabricator.wikimedia.org/T396365) (owner: 10RobH) [15:19:52] (03PS2) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 [15:21:47] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10954534 (10Jhancock.wm) no, sorry, still getting the same error [15:22:11] (03PS1) 10Bernard Wang: Prevent extra scrolling when dialog is open on ios [skins/MinervaNeue] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164474 (https://phabricator.wikimedia.org/T397539) [15:22:51] (03PS1) 10Bernard Wang: Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) [15:23:15] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:23:28] (03CR) 10Brouberol: [C:03+1] "Perfect, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) (owner: 10Btullis) [15:24:15] (03CR) 10CI reject: [V:04-1] sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:27:58] (03CR) 10Kamila Součková: "The private config would look like https://phabricator.wikimedia.org/P78219#313963 ." [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [15:29:13] (03PS15) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [15:31:04] !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox [15:31:14] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:31:20] (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [15:34:08] (03PS16) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [15:34:28] jhancock@cumin1003 provision (PID 3472048) is awaiting input [15:35:15] !log dwisehaupt@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decommissioning of frack hosts. - dwisehaupt@cumin1002" [15:35:33] !log jhancock@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:36:16] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decommissioning of frack hosts. - dwisehaupt@cumin1002" [15:36:16] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:17] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:36:27] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:36:57] (03PS63) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [15:37:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:37:31] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:38:27] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#10954589 (10Dwisehaupt) Sorry about that. I ran the authdns-update but totally forgot about the cookbook. I've updated my... [15:38:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:40:29] (03PS64) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [15:40:54] (03PS65) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [15:41:13] (03CR) 10Cathal Mooney: sre.dns.netbox-future cookbook (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:41:14] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:42:00] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:42:15] (03PS17) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [15:43:21] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#10954596 (10Stevemunene) Thanks @Dwisehaupt [15:44:48] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10954598 (10Jhancock.wm) Traceback (most recent call last): File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 497, in _found_diffs_bios_attributes if not bios_attributes[key... [15:45:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:45:23] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:45:47] !log stevemunene@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-conf1002.eqiad.wmnet [15:46:13] (03PS3) 10Ssingh: acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [15:46:55] (03CR) 10Ssingh: "rebasing to remove the already removed mx: bits in acme_chief.yaml in I9b88619fcc82946873bcef0e254bfe351a22db45" [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [15:47:13] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6103/co" [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [15:50:10] 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#10954647 (10Vgutierrez) https://salsa.debian.org/cloud-team/aws-lc could be a good starting point for aws-lc packages [15:50:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:51:40] !log stevemunene@cumin1003 START - Cookbook sre.dns.netbox [15:52:08] (03PS10) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [15:53:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:55:00] !log stevemunene@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-conf1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1003" [15:55:35] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-conf1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1003" [15:55:35] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:55:36] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-conf1002.eqiad.wmnet [15:56:00] !log stevemunene@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-conf1003.eqiad.wmnet [15:56:38] (03CR) 10Ssingh: [V:03+1] "looks good, nice reduce, one nit:" [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [15:56:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:57:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:35] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:58:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:58:27] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:58:41] (03CR) 10Ssingh: [V:03+1 C:03+1] acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [15:59:48] (03PS4) 10Vgutierrez: acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) [15:59:56] (03PS11) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [15:59:57] (03PS18) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [16:00:09] (03CR) 10Vgutierrez: "thx for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [16:00:10] (03CR) 10Ssingh: [C:03+1] acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [16:00:12] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:00:23] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:00:35] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [16:00:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:01:22] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:01:50] !log stevemunene@cumin1003 START - Cookbook sre.dns.netbox [16:02:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:02:37] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:02:46] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:02:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:04:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:04:29] (03PS12) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [16:04:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:05:06] !log stevemunene@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-conf1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1003" [16:05:26] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-conf1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1003" [16:05:26] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:28] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-conf1003.eqiad.wmnet [16:07:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:09:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:09:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:09:48] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [16:09:55] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10954728 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2003.codfw.wmnet with OS bookworm [16:10:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:11:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:13:33] (03PS7) 10Tchanders: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) [16:13:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frban1001 - https://phabricator.wikimedia.org/T397869#10954736 (10VRiley-WMF) 05Open→03Resolved [16:14:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frban1001 - https://phabricator.wikimedia.org/T397869#10954742 (10VRiley-WMF) Unracked and decommed this server [16:14:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:15:37] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:41] (03PS19) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [16:18:53] (03CR) 10JHathaway: "thanks for the review volans, I think it is ready for a second look" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [16:19:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): decommission htmldumper1001.eqiad.wmnet - https://phabricator.wikimedia.org/T397434#10954772 (10VRiley-WMF) 05Open→03Resolved [16:20:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): decommission htmldumper1001.eqiad.wmnet - https://phabricator.wikimedia.org/T397434#10954790 (10VRiley-WMF) This has been decommed [16:20:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:22:55] (03PS20) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [16:23:07] (03PS66) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [16:24:30] (03PS21) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [16:27:40] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [16:27:52] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS bookworm [16:31:39] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 13Patch-Needs-Improvement: role::puppetmaster::standalone clones Git repositories as gitpuppet, git-sync-upstream overwrites them as root - https://phabricator.wikimedia.org/T152059#10954854 (10taavi) 05Open→03Resolved I believe this was fixed with the... [16:31:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:32:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [16:32:07] (03PS1) 10Volans: kubernetes: raise 400 on missing image [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164480 (https://phabricator.wikimedia.org/T397696) [16:32:56] (03CR) 10Elukey: [C:03+1] kubernetes: raise 400 on missing image [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164480 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:33:31] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:34:28] (03CR) 10Volans: [C:03+2] kubernetes: raise 400 on missing image [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164480 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:35:01] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10954860 (10Jhancock.wm) [16:35:09] (03Abandoned) 10Majavah: puppetmaster: Clone repositories in Labs as root [puppet] - 10https://gerrit.wikimedia.org/r/324727 (https://phabricator.wikimedia.org/T152059) (owner: 10Tim Landscheidt) [16:35:31] (03Merged) 10jenkins-bot: kubernetes: raise 400 on missing image [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164480 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:36:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:37:53] (03PS1) 10Volans: Upstream release v0.6.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164482 [16:38:08] (03CR) 10Volans: [C:03+2] Upstream release v0.6.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164482 (owner: 10Volans) [16:38:22] (03PS1) 10Majavah: P:exim4::smarthost: Migrate to ec-prime256v1 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1164483 [16:38:57] (03Merged) 10jenkins-bot: Upstream release v0.6.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164482 (owner: 10Volans) [16:40:10] !log ayounsi@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2009.codfw.wmnet with OS bookworm [16:41:25] (03PS22) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [16:42:37] !log uploaded debmonitor-server,python3-debmonitor_0.6.2 to apt.wikimedia.org bookworm-wikimedia [16:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:40] (03PS23) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [16:45:48] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [16:45:53] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10954914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2004.codfw.wmnet with OS bookworm [16:46:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:48:41] (03PS1) 10Volans: debmonitor: use the new endpoint for the check [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696) [16:49:26] (03CR) 10Volans: [C:04-1] "Do be merged only after production has been updated to the latest Debmonitor-server version or this will start failing on the production i" [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:51:25] (03CR) 10Volans: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:51:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2003.codfw.wmnet with OS bookworm [16:51:31] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10954924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2003.codfw.wmnet with OS bookworm completed: - sretest2003 (**PASS**)... [16:51:57] (03PS4) 10Ayounsi: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [16:53:07] RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [16:53:28] (03PS24) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [16:59:43] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10954940 (10cmooney) Folks you need to delete the interfaces on the box to get around this. I've done that now, Jenn hopefully will work... [17:01:59] (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [17:13:49] (03CR) 10Dreamy Jazz: [C:03+1] "LGTM from a Trust and Safety Product Team point of view (team working on temporary accounts). This should be fine to merge in a backport w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) (owner: 10LD) [17:13:58] (03PS4) 10Kamila Součková: Add fake hcaptcha proxy secrets. [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T397841) [17:14:04] (03CR) 10Tchanders: [C:03+1] frwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) (owner: 10LD) [17:14:16] (03PS2) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) [17:15:18] (03CR) 10JHathaway: reimage: add support for using the host UUID for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [17:16:05] (03PS67) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [17:16:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) (owner: 10LD) [17:17:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) (owner: 10LD) [17:17:51] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10954980 (10Jhancock.wm) 05Open→03Resolved [17:20:26] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10954987 (10Jhancock.wm) @Marostegui this test server is similar to your es servers. It has 1CPU. Would you like to do some testing with this one? I've done the testing i need to on... [17:22:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:23:39] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10954990 (10Jhancock.wm) @bking this server is similar to the elastic servers but with 1 CPU. I've finished the testing I need to do. Would you like to take it for testing? [17:27:01] (03PS68) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [17:28:57] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10954997 (10Jhancock.wm) 05Open→03Resolved [17:36:08] jhancock@cumin1003 provision (PID 3487176) is awaiting input [17:36:45] FIRING: CirrusStreamingUpdaterFlinkNoRegisteredTask: ... [17:36:45] cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask [17:41:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:41:16] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:45:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:17] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:47:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:48:57] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:44] RESOLVED: CirrusStreamingUpdaterFlinkNoRegisteredTask: ... [17:51:45] cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask [17:52:30] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10955038 (10Jhancock.wm) got it to run. hit a few new errors. nic is different i think and the cookbook isn't seeing it. this is what it is: NIC Slot 5: Broadcom BCM574... [17:52:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:52:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:00:42] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10955048 (10Andrew) >>! In T396396#10954940, @cmooney wrote: > Folks you need to delete the interfaces on the box to get around this.... [18:02:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:06:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:11:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:14:02] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:19:19] (03PS1) 10SD0001: Re-enable wgSpecialGadgetUsageActiveUsers for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164490 (https://phabricator.wikimedia.org/T397454) [18:26:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164490 (https://phabricator.wikimedia.org/T397454) (owner: 10SD0001) [18:29:30] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075 (10Htriedman) 03NEW [18:33:31] FIRING: [2x] ProbeDown: Service debmonitor2003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:35:27] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10955111 (10FNavas-foundation) yes please, confirming @Htriedman need. @HShaikh fyi [18:37:11] (03PS1) 10Andrew Bogott: Openstack designate: use 'designate' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150) [18:37:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:42:10] PROBLEM - Disk space on prometheus1005 is CRITICAL: DISK CRITICAL - free space: / 2265MiB (3% inode=97%): /tmp 2265MiB (3% inode=97%): /var/tmp 2265MiB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1005&var-datasource=eqiad+prometheus/ops [18:42:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:47:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:02:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:26:41] !log dancy@deploy1003 Installing scap version "4.184.1" for 2 host(s) [19:27:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:28:32] !log dancy@deploy1003 Installation of scap version "4.184.1" completed for 2 hosts [19:33:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:34:40] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: decommission cloudcephosd200[12]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10955226 (10Andrew) [19:37:15] (03PS3) 10Andrea Denisse: centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 [19:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:50:48] (03CR) 10AOkoth: doc: decom doc2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [19:55:19] (03PS1) 10Andrew Bogott: Add dummy ldap passwords for placement service user [labs/private] - 10https://gerrit.wikimedia.org/r/1164499 (https://phabricator.wikimedia.org/T273150) [20:00:53] (03PS2) 10Andrew Bogott: Openstack designate: use 'designate' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150) [20:00:53] (03PS1) 10Andrew Bogott: Openstack placement: use 'placement' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164500 (https://phabricator.wikimedia.org/T273150) [20:01:14] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add dummy ldap passwords for placement service user [labs/private] - 10https://gerrit.wikimedia.org/r/1164499 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:01:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164500 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:05:10] (03CR) 10Andrew Bogott: [C:03+2] Openstack placement: use 'placement' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164500 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:08:22] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 613.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:08:52] PROBLEM - MariaDB Replica Lag: m2 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:11:45] few hosts had a sudden spike https://grafana.wikimedia.org/goto/0vMAH9ENg?orgId=1 [20:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:15:22] they are recovering quickly [20:16:52] RECOVERY - MariaDB Replica Lag: m2 on db1217 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:21:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:27:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:28:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:35:22] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:36:13] (03PS13) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [20:40:18] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:40:54] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:41:12] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 555 bytes in 3.942 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:41:44] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 563 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:45:54] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:46:18] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:47:44] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.423 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:48:10] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 1.016 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:51:14] (03PS1) 10Andrew Bogott: Add dummy ldap passwords for trove service user [labs/private] - 10https://gerrit.wikimedia.org/r/1164503 (https://phabricator.wikimedia.org/T273150) [20:51:54] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:52:20] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:53:41] (03PS1) 10Andrew Bogott: Openstack trove: use 'trove' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164504 (https://phabricator.wikimedia.org/T273150) [20:54:12] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 555 bytes in 3.939 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:54:50] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 565 bytes in 5.458 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:59:32] (03Abandoned) 10Andrew Bogott: Add dummy ldap passwords for trove service user [labs/private] - 10https://gerrit.wikimedia.org/r/1164503 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:59:59] (03PS1) 10Novem Linguae: initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164506 (https://phabricator.wikimedia.org/T398080) [21:02:21] (03PS2) 10Andrew Bogott: Openstack trove: use 'trove' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164504 (https://phabricator.wikimedia.org/T273150) [21:02:21] (03PS3) 10Andrew Bogott: Openstack designate: use 'designate' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150) [21:03:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164504 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [21:07:20] (03PS1) 10Novem Linguae: refactor unnecessary wmgSecurePollUseNamespace variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164507 [21:07:50] (03PS1) 10Andrew Bogott: Fix misnamed fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1164508 [21:08:12] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Fix misnamed fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1164508 (owner: 10Andrew Bogott) [21:08:23] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Openstack trove: use 'trove' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164504 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [21:27:38] (03CR) 10SD0001: [C:03+1] initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164506 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [21:30:58] (03CR) 10JHathaway: reimage: add support for using the host UUID for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [21:49:18] (03CR) 10Zabe: [C:03+1] refactor unnecessary wmgSecurePollUseNamespace variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164507 (owner: 10Novem Linguae) [22:00:45] !log updated security patch for T355073 (scap update-patch) [22:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:31] FIRING: [2x] ProbeDown: Service debmonitor2003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:43:00] !log Start `GrowthExperiments:revalidateLinkRecommendations` for enwiki (T386867) [22:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:06] T386867: Add a Link: add "do not link" rule for country names (Q6256) on English Wikipedia - https://phabricator.wikimedia.org/T386867 [23:21:09] !log truncate /var/log/syslog on prometheus1005 T398091 [23:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:15] T398091: Prometheus1005 out of disk on / - https://phabricator.wikimedia.org/T398091 [23:22:10] RECOVERY - Disk space on prometheus1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1005&var-datasource=eqiad+prometheus/ops [23:38:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1164518 [23:38:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1164518 (owner: 10TrainBranchBot) [23:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:53:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1164518 (owner: 10TrainBranchBot)