[00:01:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:03:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:04:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T343718)', diff saved to https://phabricator.wikimedia.org/P51786 and previous config saved to /var/cache/conftool/dbconfig/20230829-000406-ladsgroup.json [00:04:14] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [00:04:51] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:08:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.011 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:11:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P51787 and previous config saved to /var/cache/conftool/dbconfig/20230829-001154-ladsgroup.json [00:17:35] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [00:19:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P51788 and previous config saved to /var/cache/conftool/dbconfig/20230829-001912-ladsgroup.json [00:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P51789 and previous config saved to /var/cache/conftool/dbconfig/20230829-002700-ladsgroup.json [00:30:36] (Nonwrite HTTP requests with primary DB writes alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [00:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P51790 and previous config saved to /var/cache/conftool/dbconfig/20230829-003418-ladsgroup.json [00:37:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952345 [00:38:23] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952345 (owner: 10TrainBranchBot) [00:42:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T343718)', diff saved to https://phabricator.wikimedia.org/P51791 and previous config saved to /var/cache/conftool/dbconfig/20230829-004207-ladsgroup.json [00:42:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [00:42:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [00:42:13] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [00:42:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51792 and previous config saved to /var/cache/conftool/dbconfig/20230829-004217-ladsgroup.json [00:42:51] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:59] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T343718)', diff saved to https://phabricator.wikimedia.org/P51793 and previous config saved to /var/cache/conftool/dbconfig/20230829-004925-ladsgroup.json [00:49:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [00:49:31] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [00:49:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [00:50:36] (Nonwrite HTTP requests with primary DB writes alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [00:52:29] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.07 ms [00:54:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952345 (owner: 10TrainBranchBot) [00:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51794 and previous config saved to /var/cache/conftool/dbconfig/20230829-005933-ladsgroup.json [00:59:40] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T345122 (10phaultfinder) [01:14:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P51795 and previous config saved to /var/cache/conftool/dbconfig/20230829-011440-ladsgroup.json [01:29:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P51796 and previous config saved to /var/cache/conftool/dbconfig/20230829-012946-ladsgroup.json [01:41:13] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:44:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51797 and previous config saved to /var/cache/conftool/dbconfig/20230829-014452-ladsgroup.json [01:44:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [01:44:58] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [01:45:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [01:46:55] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T0200) [02:01:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [02:01:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [02:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:07:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.24 [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/952966 (https://phabricator.wikimedia.org/T343726) [02:07:08] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.24 [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/952966 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [02:08:55] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:21:56] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.24 [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/952966 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [02:28:55] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:43] (03CR) 10Samwilson: [C: 03+1] Allow loading Edit-in-Sequence as a beta feature on Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [02:42:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:47:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:50:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage [02:53:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage [02:59:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51798 and previous config saved to /var/cache/conftool/dbconfig/20230829-025950-ladsgroup.json [02:59:56] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T0300) [03:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:08:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [03:09:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [03:09:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be2003.codfw.wmnet with OS bullseye [03:09:23] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye completed: -... [03:10:21] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T345122 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. ignore. [03:11:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:11:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) 05Open→03Resolved [03:14:13] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) @MatthewVernon all yours. last name is installed. [03:14:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P51799 and previous config saved to /var/cache/conftool/dbconfig/20230829-031456-ladsgroup.json [03:17:10] (03PS1) 10Andrew Bogott: trove: increase guest agent timeout [puppet] - 10https://gerrit.wikimedia.org/r/952955 (https://phabricator.wikimedia.org/T345004) [03:24:20] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P51800 and previous config saved to /var/cache/conftool/dbconfig/20230829-033002-ladsgroup.json [03:36:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:45:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51801 and previous config saved to /var/cache/conftool/dbconfig/20230829-034509-ladsgroup.json [03:45:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:45:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [03:45:15] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [03:45:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:45:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [03:45:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51802 and previous config saved to /var/cache/conftool/dbconfig/20230829-034530-ladsgroup.json [03:45:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51803 and previous config saved to /var/cache/conftool/dbconfig/20230829-034540-ladsgroup.json [03:46:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [03:46:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2025'] [03:51:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:56:38] (03CR) 10Andrew Bogott: [C: 03+2] trove: increase guest agent timeout [puppet] - 10https://gerrit.wikimedia.org/r/952955 (https://phabricator.wikimedia.org/T345004) (owner: 10Andrew Bogott) [04:26:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:40:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [04:40:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [04:40:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:40:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:40:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:40:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T343718)', diff saved to https://phabricator.wikimedia.org/P51804 and previous config saved to /var/cache/conftool/dbconfig/20230829-044049-ladsgroup.json [04:40:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:40:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:51:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:56:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:58:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:58:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T343718)', diff saved to https://phabricator.wikimedia.org/P51805 and previous config saved to /var/cache/conftool/dbconfig/20230829-045847-ladsgroup.json [04:58:53] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:01:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [05:01:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [05:03:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:13:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P51806 and previous config saved to /var/cache/conftool/dbconfig/20230829-051353-ladsgroup.json [05:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:22:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [05:22:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [05:22:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T343718)', diff saved to https://phabricator.wikimedia.org/P51807 and previous config saved to /var/cache/conftool/dbconfig/20230829-052222-ladsgroup.json [05:22:30] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:26:18] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1117.eqiad.wmnet with OS bullseye [05:26:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:27:29] (03PS2) 10Amire80: Add lucaswerkmeister.de to Planet [puppet] - 10https://gerrit.wikimedia.org/r/948203 [05:28:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:29:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P51808 and previous config saved to /var/cache/conftool/dbconfig/20230829-052859-ladsgroup.json [05:31:22] (03PS2) 10KartikMistry: Enable Content and Section translation in Ligurian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952846 (https://phabricator.wikimedia.org/T337669) [05:33:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:36:08] (03CR) 10Marostegui: [C: 03+2] mariadb: Comments to clarify db1118 situation [puppet] - 10https://gerrit.wikimedia.org/r/952885 (owner: 10Marostegui) [05:38:56] (03PS1) 10Marostegui: db1155: Upgrade to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953133 (https://phabricator.wikimedia.org/T334650) [05:39:28] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1117.eqiad.wmnet with reason: host reimage [05:39:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T343718)', diff saved to https://phabricator.wikimedia.org/P51809 and previous config saved to /var/cache/conftool/dbconfig/20230829-053953-ladsgroup.json [05:39:58] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:40:19] (03CR) 10Marostegui: [C: 03+2] db1155: Upgrade to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953133 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [05:42:37] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1117.eqiad.wmnet with reason: host reimage [05:44:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T343718)', diff saved to https://phabricator.wikimedia.org/P51810 and previous config saved to /var/cache/conftool/dbconfig/20230829-054405-ladsgroup.json [05:44:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:44:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:46:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:51:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:54:30] (03CR) 10Elukey: [C: 03+1] Increase the kafka-jumbo maximum message size to 10 MB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [05:54:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P51811 and previous config saved to /var/cache/conftool/dbconfig/20230829-055459-ladsgroup.json [05:56:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:58:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:59:38] (03PS1) 10Marostegui: install_server: Do not reimage db2192 [puppet] - 10https://gerrit.wikimedia.org/r/953134 [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T0600) [06:00:04] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T0600). [06:00:08] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2192 [puppet] - 10https://gerrit.wikimedia.org/r/953134 (owner: 10Marostegui) [06:00:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [06:00:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [06:00:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T343718)', diff saved to https://phabricator.wikimedia.org/P51812 and previous config saved to /var/cache/conftool/dbconfig/20230829-060047-ladsgroup.json [06:01:10] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:03:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:04:48] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1117.eqiad.wmnet with OS bullseye [06:05:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51813 and previous config saved to /var/cache/conftool/dbconfig/20230829-060519-ladsgroup.json [06:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:09:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51814 and previous config saved to /var/cache/conftool/dbconfig/20230829-060956-ladsgroup.json [06:10:04] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:10:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P51815 and previous config saved to /var/cache/conftool/dbconfig/20230829-061005-ladsgroup.json [06:13:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:17:12] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1118.eqiad.wmnet with OS bullseye [06:17:31] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1119.eqiad.wmnet with OS bullseye [06:18:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:19:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T343718)', diff saved to https://phabricator.wikimedia.org/P51816 and previous config saved to /var/cache/conftool/dbconfig/20230829-061904-ladsgroup.json [06:19:09] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:20:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P51817 and previous config saved to /var/cache/conftool/dbconfig/20230829-062025-ladsgroup.json [06:25:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P51818 and previous config saved to /var/cache/conftool/dbconfig/20230829-062502-ladsgroup.json [06:25:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T343718)', diff saved to https://phabricator.wikimedia.org/P51819 and previous config saved to /var/cache/conftool/dbconfig/20230829-062511-ladsgroup.json [06:25:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:25:17] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:25:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:25:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T343718)', diff saved to https://phabricator.wikimedia.org/P51820 and previous config saved to /var/cache/conftool/dbconfig/20230829-062532-ladsgroup.json [06:28:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:28:55] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:53] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1118.eqiad.wmnet with reason: host reimage [06:31:08] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1119.eqiad.wmnet with reason: host reimage [06:33:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:33:56] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1118.eqiad.wmnet with reason: host reimage [06:34:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P51821 and previous config saved to /var/cache/conftool/dbconfig/20230829-063410-ladsgroup.json [06:35:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P51822 and previous config saved to /var/cache/conftool/dbconfig/20230829-063531-ladsgroup.json [06:36:10] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1119.eqiad.wmnet with reason: host reimage [06:40:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P51823 and previous config saved to /var/cache/conftool/dbconfig/20230829-064009-ladsgroup.json [06:43:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T343718)', diff saved to https://phabricator.wikimedia.org/P51824 and previous config saved to /var/cache/conftool/dbconfig/20230829-064313-ladsgroup.json [06:43:19] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:49:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P51825 and previous config saved to /var/cache/conftool/dbconfig/20230829-064916-ladsgroup.json [06:50:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51826 and previous config saved to /var/cache/conftool/dbconfig/20230829-065038-ladsgroup.json [06:50:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [06:50:43] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:50:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [06:51:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51827 and previous config saved to /var/cache/conftool/dbconfig/20230829-065059-ladsgroup.json [06:54:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [06:55:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51828 and previous config saved to /var/cache/conftool/dbconfig/20230829-065515-ladsgroup.json [06:55:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [06:55:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [06:55:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T343718)', diff saved to https://phabricator.wikimedia.org/P51829 and previous config saved to /var/cache/conftool/dbconfig/20230829-065525-ladsgroup.json [06:57:14] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1118.eqiad.wmnet with OS bullseye [06:57:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [06:58:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P51830 and previous config saved to /var/cache/conftool/dbconfig/20230829-065819-ladsgroup.json [06:58:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:58:59] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1119.eqiad.wmnet with OS bullseye [07:00:05] Amir1, Urbanecm, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:09] * kart_ is here [07:01:51] I will self-deploy. Small config change only. [07:02:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952846 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [07:03:15] (03Merged) 10jenkins-bot: Enable Content and Section translation in Ligurian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952846 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [07:03:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:03:43] !log kartik@deploy1002 Started scap: Backport for [[gerrit:952846|Enable Content and Section translation in Ligurian Wikipedia (T337669)]] [07:03:48] T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669 [07:04:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [07:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T343718)', diff saved to https://phabricator.wikimedia.org/P51831 and previous config saved to /var/cache/conftool/dbconfig/20230829-070422-ladsgroup.json [07:04:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:04:32] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:04:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:04:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [07:04:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51832 and previous config saved to /var/cache/conftool/dbconfig/20230829-070443-ladsgroup.json [07:05:38] (03CR) 10Muehlenhoff: Add some ferm->nft migration steps to the firewall class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952862 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:07:38] !log kartik@deploy1002 kartik: Backport for [[gerrit:952846|Enable Content and Section translation in Ligurian Wikipedia (T337669)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:07:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [07:11:52] (03CR) 10Muehlenhoff: [C: 03+2] java: Remove now obsolete warning [puppet] - 10https://gerrit.wikimedia.org/r/952879 (owner: 10Muehlenhoff) [07:12:08] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1120.eqiad.wmnet with OS bullseye [07:12:25] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1121.eqiad.wmnet with OS bullseye [07:13:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove comment references to EOLed distros [puppet] - 10https://gerrit.wikimedia.org/r/952877 (owner: 10Muehlenhoff) [07:13:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P51833 and previous config saved to /var/cache/conftool/dbconfig/20230829-071326-ladsgroup.json [07:13:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:14:07] (03CR) 10Muehlenhoff: [C: 03+2] SSH cloud access: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952880 (owner: 10Muehlenhoff) [07:14:39] !log kartik@deploy1002 kartik: Continuing with sync [07:15:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/952897 (owner: 10Volans) [07:17:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [07:18:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:22:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:22:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51834 and previous config saved to /var/cache/conftool/dbconfig/20230829-072249-ladsgroup.json [07:22:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:23:02] (03PS1) 10Stevemunene: idp_test: change datahub_staging profile_format [puppet] - 10https://gerrit.wikimedia.org/r/953192 (https://phabricator.wikimedia.org/T305874) [07:23:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [07:23:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [07:24:46] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:952846|Enable Content and Section translation in Ligurian Wikipedia (T337669)]] (duration: 21m 02s) [07:24:51] T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669 [07:25:06] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/953192 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [07:25:11] I'm done with backport. [07:25:51] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1120.eqiad.wmnet with reason: host reimage [07:26:00] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1121.eqiad.wmnet with reason: host reimage [07:26:55] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [07:27:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [07:27:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:28:24] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1120.eqiad.wmnet with reason: host reimage [07:28:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T343718)', diff saved to https://phabricator.wikimedia.org/P51835 and previous config saved to /var/cache/conftool/dbconfig/20230829-072832-ladsgroup.json [07:28:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [07:28:37] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:28:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:28:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [07:28:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [07:28:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T343718)', diff saved to https://phabricator.wikimedia.org/P51836 and previous config saved to /var/cache/conftool/dbconfig/20230829-072853-ladsgroup.json [07:31:05] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1121.eqiad.wmnet with reason: host reimage [07:31:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [07:31:30] (03CR) 10Stevemunene: [C: 03+2] idp_test: change datahub_staging profile_format [puppet] - 10https://gerrit.wikimedia.org/r/953192 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [07:32:49] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:37:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [07:37:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [07:37:49] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P51837 and previous config saved to /var/cache/conftool/dbconfig/20230829-073755-ladsgroup.json [07:39:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [07:41:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [07:41:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [07:43:19] (03CR) 10David Caro: wmf_sink: catch ssl errors when talking to the proxy server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952919 (https://phabricator.wikimedia.org/T345103) (owner: 10Andrew Bogott) [07:43:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:44:16] (03CR) 10Muehlenhoff: "A few additional comments inline, looks good otherwise." [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [07:45:29] (03CR) 10Volans: [C: 03+2] sre.hosts: remove stretch from list of OSes [cookbooks] - 10https://gerrit.wikimedia.org/r/952897 (owner: 10Volans) [07:46:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T343718)', diff saved to https://phabricator.wikimedia.org/P51838 and previous config saved to /var/cache/conftool/dbconfig/20230829-074643-ladsgroup.json [07:46:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [07:46:51] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:48:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [07:48:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [07:48:21] (03Merged) 10jenkins-bot: sre.hosts: remove stretch from list of OSes [cookbooks] - 10https://gerrit.wikimedia.org/r/952897 (owner: 10Volans) [07:48:33] I'm going to reboot contint2002 in 15 minutes, integration jenkins will be down for some minutes [07:48:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:49:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [07:50:20] 10SRE, 10Research, 10The-Wikipedia-Library, 10Traffic, and 4 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276 (10Xover) >>! In T87276#8784283, @Xover wrote: > This is still spewing errors in the console for every page loa... [07:51:32] (03PS2) 10Jcrespo: mariadb: Upgrade db1225 to mariadb 10.6 (and generate 10.6 backups) [puppet] - 10https://gerrit.wikimedia.org/r/942652 (https://phabricator.wikimedia.org/T334650) [07:51:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:51:40] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1120.eqiad.wmnet with OS bullseye [07:53:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P51839 and previous config saved to /var/cache/conftool/dbconfig/20230829-075301-ladsgroup.json [07:54:17] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1121.eqiad.wmnet with OS bullseye [07:54:44] (03PS1) 10EoghanGaffney: gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193 [07:56:32] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43035/console" [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney) [07:56:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:56:42] (03PS5) 10Slyngshede: Allow Unix shell account to be specified. [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 [07:56:50] (03CR) 10Slyngshede: Allow Unix shell account to be specified. (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [07:59:11] (03PS1) 10Filippo Giunchedi: grafana: re-create user when syncing LDAP [puppet] - 10https://gerrit.wikimedia.org/r/953195 (https://phabricator.wikimedia.org/T341574) [08:01:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P51840 and previous config saved to /var/cache/conftool/dbconfig/20230829-080149-ladsgroup.json [08:01:50] hashar: I'm going to reboot contint2002 now [08:02:01] Zuul CI is processing a few changes but they should be saved and restored after reboot [08:02:15] worth case I will recheck them (they are all for ContentTranslation mediawiki extension) [08:02:19] so +1 :) [08:02:24] jouncebot: nowandnext [08:02:25] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [08:02:25] In 1 hour(s) and 57 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1000) [08:02:43] ok thanks, then I'll proceed [08:03:10] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host contint2002.wikimedia.org [08:05:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [08:05:17] 10SRE, 10Research, 10The-Wikipedia-Library, 10Traffic, and 4 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276 (10TheDJ) Removal is already in progress via T338183 [08:06:35] host is back already, but cookbook still running checks [08:07:24] yeah I think the cookbook does some kind of busy loop with a long timeout [08:07:44] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51841 and previous config saved to /var/cache/conftool/dbconfig/20230829-080807-ladsgroup.json [08:08:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [08:08:13] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:08:17] easy :] [08:08:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [08:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T343718)', diff saved to https://phabricator.wikimedia.org/P51842 and previous config saved to /var/cache/conftool/dbconfig/20230829-080828-ladsgroup.json [08:09:08] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:44] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint2002.wikimedia.org [08:10:11] hashar: done. contentTranslation jobs are also running again? [08:10:20] they got lost [08:10:28] I think on a reboot the systemd unit is killed [08:10:32] and loose all jobs [08:10:50] it is not a big deal when that is done during european mornings since there is typically only a few patches flying in [08:11:07] usually for i18n team (such as ContentTranslation) or WMDE (for Wikibase and such) or SRE (for puppet) [08:11:15] ack [08:11:16] which are easy to catch and `recheck` [08:11:27] there is some jobs lost for good, but they are not important [08:11:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [08:11:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [08:12:12] theorically we could have the unit to gracefully shutdown Zuul, it would then wait for all ongoing jobs to complete which can take 20/25 minutes and further delay the reboot [08:12:23] ok makes sense [08:12:38] so hard dropping the on fly jobs is the way to go, given we only restart the machine every six months or so [08:13:45] (03PS1) 10Ladsgroup: Take away ppenloglou's production ssh access [puppet] - 10https://gerrit.wikimedia.org/r/953196 [08:14:50] hashar: but in https://integration.wikimedia.org/zuul/ I can see some contentTranslation jobs again. Or did you restart the jobs? [08:14:52] (03CR) 10Ladsgroup: [C: 03+2] Take away ppenloglou's production ssh access [puppet] - 10https://gerrit.wikimedia.org/r/953196 (owner: 10Ladsgroup) [08:16:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P51843 and previous config saved to /var/cache/conftool/dbconfig/20230829-081655-ladsgroup.json [08:20:19] 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Ladsgroup) [08:20:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [08:23:16] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:26:21] !log downtiming cassandra-a alerts on restbase1030.eqiad.wmnet for 14 days T344210 T344259 [08:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:28] T344259: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 [08:26:29] T344210: restbase1030: Cassandra crashing (signal 11) - https://phabricator.wikimedia.org/T344210 [08:26:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T343718)', diff saved to https://phabricator.wikimedia.org/P51844 and previous config saved to /var/cache/conftool/dbconfig/20230829-082644-ladsgroup.json [08:26:49] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:27:14] jelto: I have restarted one of the change by commenting `recheck`, the rest I guess is normal activities (developers doing stuff) [08:27:39] ok great thanks. Then we can call it done? [08:28:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:29:00] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:47] !log Restarted grafana-ldap-users-sync.service on grafana1002 [08:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T343718)', diff saved to https://phabricator.wikimedia.org/P51845 and previous config saved to /var/cache/conftool/dbconfig/20230829-083202-ladsgroup.json [08:32:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [08:32:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:32:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [08:32:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T343718)', diff saved to https://phabricator.wikimedia.org/P51846 and previous config saved to /var/cache/conftool/dbconfig/20230829-083223-ladsgroup.json [08:33:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:36:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [08:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:41:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:41:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P51847 and previous config saved to /var/cache/conftool/dbconfig/20230829-084150-ladsgroup.json [08:42:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [08:42:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [08:43:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953195 (https://phabricator.wikimedia.org/T341574) (owner: 10Filippo Giunchedi) [08:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T343718)', diff saved to https://phabricator.wikimedia.org/P51848 and previous config saved to /var/cache/conftool/dbconfig/20230829-084955-ladsgroup.json [08:50:01] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:52:43] (03PS1) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 [08:55:27] (03CR) 10Majavah: "Bug: T342755 maybe?" [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [08:56:55] (03PS2) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) [08:56:57] (03PS1) 10Muehlenhoff: Stop building stretch images and update monitoring for the docker registry [puppet] - 10https://gerrit.wikimedia.org/r/953201 [08:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P51849 and previous config saved to /var/cache/conftool/dbconfig/20230829-085656-ladsgroup.json [08:57:11] (03CR) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert) [09:00:51] (03PS1) 10Jbond: cumin::master: add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/953202 (https://phabricator.wikimedia.org/T341497) [09:00:53] (03PS1) 10Jbond: cumin: update cumin host to use the puppetdb-micro service [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) [09:03:17] (03PS2) 10Jbond: cumin: update cumin host to use the puppetdb-micro service [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) [09:03:44] (03CR) 10Muehlenhoff: "Looks good, one typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/953202 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [09:04:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43038/console" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [09:05:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P51850 and previous config saved to /var/cache/conftool/dbconfig/20230829-090501-ladsgroup.json [09:06:00] (03CR) 10Clément Goubert: [C: 04-1] "There's an error in the Dockerfile, but otherwise LGTM." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [09:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:06:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:07:57] (03CR) 10Muehlenhoff: "Looks good, one missing thing inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [09:08:08] (03CR) 10Clément Goubert: [C: 04-1] "Another thing to keep in mind, even though it is not used in production, is a change to drain-envoy.sh to communicate through that socket " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [09:09:13] (03CR) 10Clément Goubert: [C: 04-1] envoy: Create /var/run/envoy (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [09:10:08] (03PS6) 10Slyngshede: Allow Unix shell account to be specified. [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 [09:10:20] (03CR) 10Slyngshede: Allow Unix shell account to be specified. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [09:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T343718)', diff saved to https://phabricator.wikimedia.org/P51851 and previous config saved to /var/cache/conftool/dbconfig/20230829-091059-ladsgroup.json [09:11:02] (03PS1) 10Muehlenhoff: local_dev: Update image [puppet] - 10https://gerrit.wikimedia.org/r/953205 [09:11:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:11:18] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:11:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:11:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:12:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T343718)', diff saved to https://phabricator.wikimedia.org/P51852 and previous config saved to /var/cache/conftool/dbconfig/20230829-091202-ladsgroup.json [09:12:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:12:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:12:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T343718)', diff saved to https://phabricator.wikimedia.org/P51853 and previous config saved to /var/cache/conftool/dbconfig/20230829-091223-ladsgroup.json [09:13:36] (03PS3) 10Jbond: cumin: update cumin host to use the puppetdb-micro service [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) [09:13:44] (03CR) 10Clément Goubert: [C: 03+1] mesh.configuration: Bind the admin interface to a socket instead of tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 (owner: 10JMeybohm) [09:14:42] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:14:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:15:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43039/console" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [09:16:37] (03CR) 10Jbond: [V: 03+1] "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [09:17:00] (03PS2) 10Jbond: cumin::master: add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/953202 (https://phabricator.wikimedia.org/T341497) [09:17:12] (03CR) 10Jbond: [C: 03+2] "done thanks" [puppet] - 10https://gerrit.wikimedia.org/r/953202 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [09:17:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) >>! In T342161#9064747, @Andrew wrote: > I'm sure that this only needs a single nic connected unless @aborrero has something truly amb... [09:17:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51854 and previous config saved to /var/cache/conftool/dbconfig/20230829-091756-ladsgroup.json [09:18:03] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:18:37] Looks like Init7 has a tummyache [09:19:35] Ah no, wrong read [09:19:41] (03CR) 10Muehlenhoff: [C: 03+1] cumin::master: add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/953202 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [09:19:47] xe-3/2/1: down -> Transport: cr1-esams:xe-0/0/7 (Lumen, BDFS2448 80ms 10Gbps wave) {#2013} [09:20:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P51855 and previous config saved to /var/cache/conftool/dbconfig/20230829-092007-ladsgroup.json [09:20:17] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: re-create user when syncing LDAP [puppet] - 10https://gerrit.wikimedia.org/r/953195 (https://phabricator.wikimedia.org/T341574) (owner: 10Filippo Giunchedi) [09:20:22] (03PS4) 10Jbond: cumin: update cumin host to use the puppetdb-micro service [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) [09:20:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [09:22:11] !log restart db1205 [09:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:38] (03PS4) 10Jbond: kernel_report: small script to generate reboots task [puppet] - 10https://gerrit.wikimedia.org/r/952401 [09:22:41] (03CR) 10Jbond: "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/952401 (owner: 10Jbond) [09:22:56] !log failover the ganeti master in codfw to ganeti2022 [09:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:21] (03PS15) 10Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [09:24:02] (03CR) 10Jcrespo: [C: 03+1] "Ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/942652 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [09:24:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/952401 (owner: 10Jbond) [09:25:41] XioNoX: topranks: I don't actually know what to do with that BFD/BGP error, can one of you teach me? :) [09:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P51856 and previous config saved to /var/cache/conftool/dbconfig/20230829-092605-ladsgroup.json [09:26:27] claime: https://wikitech.wikimedia.org/wiki/Network_monitoring :) [09:26:42] PROBLEM - ganeti-wconfd running on ganeti2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:28:16] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 33 [09:28:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 33 [09:29:36] ganeti2020 is expected, monitoring glitch from the failover [09:29:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T343718)', diff saved to https://phabricator.wikimedia.org/P51857 and previous config saved to /var/cache/conftool/dbconfig/20230829-092957-ladsgroup.json [09:30:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:30:26] claime: transport link down -> check the maintenance list -> open task if unplanned -> use the `sre.network.debug` cookbook to add data -> open ticket with provider [09:30:32] (03PS2) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [09:30:34] (03PS26) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [09:30:38] the last step as it's lumen can only be done by netops [09:30:57] I mean can't be done by email, only phone or portal [09:31:01] (03CR) 10Filippo Giunchedi: "AFAICS in the task the generic systemd alert for unit failed (i.e. any unit) is mentioned, which would trigger anyways even with monitorin" [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert) [09:31:20] (03CR) 10JMeybohm: [C: 04-1] "I might be ignorant about this but what is the benefit of monitoring wikifunctions through an mediawiki API call rather then monitoring it" [puppet] - 10https://gerrit.wikimedia.org/r/952486 (owner: 10Jforrester) [09:31:39] (03CR) 10Jbond: [C: 03+2] kernel_report: small script to generate reboots task [puppet] - 10https://gerrit.wikimedia.org/r/952401 (owner: 10Jbond) [09:32:45] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cr2-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [09:32:54] (03PS2) 10JMeybohm: mesh.configuration: Bind the admin interface to a socket instead of tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 (https://phabricator.wikimedia.org/T343709) [09:33:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P51858 and previous config saved to /var/cache/conftool/dbconfig/20230829-093303-ladsgroup.json [09:33:32] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [09:34:07] XioNoX: Thanks for the info. I saw you ran the network debug cookbook, are you taking it from there, or do you still want me to open a task for y'all? [09:34:14] (03PS27) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [09:34:34] claime: you can open the task [09:35:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T343718)', diff saved to https://phabricator.wikimedia.org/P51859 and previous config saved to /var/cache/conftool/dbconfig/20230829-093513-ladsgroup.json [09:35:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [09:35:20] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:35:29] claime: I created Ticket ID 27488343 has been successfully created. with Lumen [09:35:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [09:35:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:35:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:35:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T343718)', diff saved to https://phabricator.wikimedia.org/P51860 and previous config saved to /var/cache/conftool/dbconfig/20230829-093539-ladsgroup.json [09:36:32] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [09:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:37:19] (03CR) 10Slyngshede: [V: 03+2] Allow Unix shell account to be specified. [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [09:37:22] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Allow Unix shell account to be specified. [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [09:37:45] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [09:38:25] claime: thx! [09:38:44] XioNoX: I'll run the cookbook just to see what it does [09:38:59] claime: sure, you can add the task ID too [09:39:06] !log cgoubert@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr1-esams:xe-0/0/7 [09:39:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-esams:xe-0/0/7 [09:39:24] (03CR) 10Muehlenhoff: "Looks good, one comment inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 (owner: 10Slyngshede) [09:39:37] -40dB that's not going to work well x) [09:40:12] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10MatthewVernon) Yes, I was expecting one node, call it moss-be1003 please? [09:40:16] yeah no light at all [09:40:51] (03PS3) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [09:41:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P51861 and previous config saved to /var/cache/conftool/dbconfig/20230829-094111-ladsgroup.json [09:41:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:42:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43042/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [09:43:02] (03PS1) 10Slyngshede: C:idm update signup validators. [puppet] - 10https://gerrit.wikimedia.org/r/953227 [09:43:56] (03CR) 10Slyngshede: [C: 03+2] C:idm update signup validators. [puppet] - 10https://gerrit.wikimedia.org/r/953227 (owner: 10Slyngshede) [09:44:29] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1122.eqiad.wmnet with OS bullseye [09:44:47] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1123.eqiad.wmnet with OS bullseye [09:45:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P51862 and previous config saved to /var/cache/conftool/dbconfig/20230829-094503-ladsgroup.json [09:46:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:46:55] (03PS28) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [09:47:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43043/console" [puppet] - 10https://gerrit.wikimedia.org/r/951460 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:48:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P51864 and previous config saved to /var/cache/conftool/dbconfig/20230829-094809-ladsgroup.json [09:49:14] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [09:49:30] (03PS3) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 [09:49:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructure - https://phabricator.wikimedia.org/T341497 (10Volans) [09:50:10] (03CR) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [09:51:00] (03PS4) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [09:51:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:51:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/952876 (owner: 10Jbond) [09:52:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43044/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [09:53:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) Is there any blocker for this server to get it ready to go? [09:53:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T343718)', diff saved to https://phabricator.wikimedia.org/P51865 and previous config saved to /var/cache/conftool/dbconfig/20230829-095321-ladsgroup.json [09:53:28] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:53:55] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:40] (03CR) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [09:56:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T343718)', diff saved to https://phabricator.wikimedia.org/P51866 and previous config saved to /var/cache/conftool/dbconfig/20230829-095617-ladsgroup.json [09:56:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [09:56:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [09:56:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T343718)', diff saved to https://phabricator.wikimedia.org/P51867 and previous config saved to /var/cache/conftool/dbconfig/20230829-095638-ladsgroup.json [09:57:58] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1122.eqiad.wmnet with reason: host reimage [09:58:14] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1123.eqiad.wmnet with reason: host reimage [09:58:43] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1000) [10:00:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P51868 and previous config saved to /var/cache/conftool/dbconfig/20230829-100009-ladsgroup.json [10:02:14] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1123.eqiad.wmnet with reason: host reimage [10:03:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51869 and previous config saved to /var/cache/conftool/dbconfig/20230829-100315-ladsgroup.json [10:03:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:03:21] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:03:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:04:00] (03PS29) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [10:04:12] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1122.eqiad.wmnet with reason: host reimage [10:05:23] (03Abandoned) 10Clément Goubert: mediawiki: Generalize tls-proxy limits removal [deployment-charts] - 10https://gerrit.wikimedia.org/r/952171 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:08:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P51870 and previous config saved to /var/cache/conftool/dbconfig/20230829-100827-ladsgroup.json [10:09:43] (03PS3) 10JMeybohm: mesh.configuration: Bind the admin interface to a socket instead of tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 (https://phabricator.wikimedia.org/T343709) [10:10:15] (03PS1) 10Jelto: miscweb: update bugzilla image to unavailable css,js [deployment-charts] - 10https://gerrit.wikimedia.org/r/953229 (https://phabricator.wikimedia.org/T343914) [10:10:31] (03CR) 10FNegri: "LGTM but I'm missing some context, is there a wiki or phab where the new puppetdb architecture is described?" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [10:11:07] (03CR) 10FNegri: cumin: update cumin host to use the puppetdb-micro service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [10:11:33] (03CR) 10JMeybohm: mediawiki: Remove tls-proxy CPU limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:13:06] (03CR) 10Jelto: [C: 03+2] miscweb: update bugzilla image to unavailable css,js [deployment-charts] - 10https://gerrit.wikimedia.org/r/953229 (https://phabricator.wikimedia.org/T343914) (owner: 10Jelto) [10:13:46] (03CR) 10JMeybohm: "I had to change the admin cluster type to static" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [10:13:54] (03Merged) 10jenkins-bot: miscweb: update bugzilla image to unavailable css,js [deployment-charts] - 10https://gerrit.wikimedia.org/r/953229 (https://phabricator.wikimedia.org/T343914) (owner: 10Jelto) [10:14:19] (03PS1) 10Clément Goubert: kubernetes: Bump envoy image version to 1.23.10-2-s1 [puppet] - 10https://gerrit.wikimedia.org/r/953230 (https://phabricator.wikimedia.org/T344814) [10:14:49] (03PS5) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [10:15:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T343718)', diff saved to https://phabricator.wikimedia.org/P51871 and previous config saved to /var/cache/conftool/dbconfig/20230829-101515-ladsgroup.json [10:15:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [10:15:22] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:15:29] (03PS6) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [10:15:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [10:15:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T343718)', diff saved to https://phabricator.wikimedia.org/P51872 and previous config saved to /var/cache/conftool/dbconfig/20230829-101536-ladsgroup.json [10:15:43] (03PS2) 10Clément Goubert: mediawiki: Remove tls-proxy CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) [10:15:54] (03PS3) 10JMeybohm: envoy: Create /var/run/envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 (https://phabricator.wikimedia.org/T343709) [10:15:56] (03CR) 10CI reject: [V: 04-1] ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:16:05] (03PS3) 10Clément Goubert: mediawiki: Remove tls-proxy CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) [10:16:17] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [10:16:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43047/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:16:26] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43046/console" [puppet] - 10https://gerrit.wikimedia.org/r/953230 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:16:39] (03PS4) 10Clément Goubert: mediawiki: Remove tls-proxy CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) [10:16:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43048/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:17:19] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [10:17:21] (03CR) 10Clément Goubert: mediawiki: Remove tls-proxy CPU limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:17:43] (03CR) 10JMeybohm: envoy: Create /var/run/envoy (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [10:17:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [10:18:25] (03PS7) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [10:19:04] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [10:19:08] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [10:19:08] (03PS5) 10Jbond: cumin: update cumin host to use the puppetdb-micro service [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) [10:19:23] (03PS6) 10Jbond: cumin: update cumin host to use the puppetdb-micro service [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) [10:19:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43049/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:19:49] (03PS2) 10Clément Goubert: kubernetes: Bump envoy image version to 1.23.10-2-s1 [puppet] - 10https://gerrit.wikimedia.org/r/953230 (https://phabricator.wikimedia.org/T344814) [10:20:03] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Create /var/run/envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [10:20:24] (03PS3) 10Clément Goubert: kubernetes: Bump envoy image version to 1.23.10-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/953230 (https://phabricator.wikimedia.org/T344814) [10:21:07] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [10:21:27] (03PS4) 10Clément Goubert: kubernetes: Bump envoy image version to 1.23.10-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/953230 (https://phabricator.wikimedia.org/T344814) [10:22:29] !log Successfully published image docker-registry.discovery.wmnet/envoy:1.23.10-2-s2 [10:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:43] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [10:23:17] (03CR) 10Jbond: cumin: update cumin host to use the puppetdb-micro service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [10:23:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P51873 and previous config saved to /var/cache/conftool/dbconfig/20230829-102333-ladsgroup.json [10:23:45] (03CR) 10JMeybohm: [C: 03+1] "Successfully published image docker-registry.discovery.wmnet/envoy:1.23.10-2-s2" [puppet] - 10https://gerrit.wikimedia.org/r/953230 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:24:24] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [10:24:43] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Remove tls-proxy CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:24:52] (03CR) 10Jbond: [V: 03+1] ferm: add ensure support to the ferm class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:25:01] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:25:19] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1123.eqiad.wmnet with OS bullseye [10:27:19] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1122.eqiad.wmnet with OS bullseye [10:27:31] !log reboot db1204 [10:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:15] (03CR) 10Clément Goubert: [C: 03+2] kubernetes: Bump envoy image version to 1.23.10-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/953230 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:30:03] !log Running puppet on deploy servers to bump envoy image version - T344814 [10:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:11] T344814: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 [10:31:42] (03CR) 10Clément Goubert: [C: 03+1] mesh.configuration: Add new minor version 1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/952901 (owner: 10JMeybohm) [10:31:57] (03CR) 10Clément Goubert: [C: 03+1] mesh.configuration: Bind the admin interface to a socket instead of tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [10:31:57] !restart db2183 [10:32:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [10:34:02] PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100% [10:34:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T343718)', diff saved to https://phabricator.wikimedia.org/P51874 and previous config saved to /var/cache/conftool/dbconfig/20230829-103409-ladsgroup.json [10:34:15] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:34:40] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:36:10] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Remove tls-proxy CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:37:01] (03Merged) 10jenkins-bot: mediawiki: Remove tls-proxy CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:37:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [10:38:17] (03CR) 10Jbond: [C: 03+2] interfaces: updated to use f-strings [puppet] - 10https://gerrit.wikimedia.org/r/952876 (owner: 10Jbond) [10:38:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T343718)', diff saved to https://phabricator.wikimedia.org/P51875 and previous config saved to /var/cache/conftool/dbconfig/20230829-103840-ladsgroup.json [10:38:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [10:38:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [10:38:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [10:39:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T343718)', diff saved to https://phabricator.wikimedia.org/P51876 and previous config saved to /var/cache/conftool/dbconfig/20230829-103901-ladsgroup.json [10:39:02] jouncebot: nowandnext [10:39:02] For the next 0 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1000) [10:39:02] In 1 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1200) [10:39:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [10:39:18] PROBLEM - Check systemd state on dbstore1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@staging.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:31] !log cgoubert@deploy1002 Started scap: Removing mw-on-k8s tls-proxy CPU limits - T344814 [10:39:37] T344814: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 [10:40:28] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [10:40:34] RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 31.90 ms [10:41:59] !log cgoubert@deploy1002 Finished scap: Removing mw-on-k8s tls-proxy CPU limits - T344814 (duration: 02m 27s) [10:42:44] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetserver1002.eqiad.wmnet with OS bookworm [10:46:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:00] (03CR) 10Hnowlan: [C: 03+1] mesh.configuration: Bind the admin interface to a socket instead of tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [10:47:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet [10:49:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P51877 and previous config saved to /var/cache/conftool/dbconfig/20230829-104915-ladsgroup.json [10:51:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:51:36] !log joal@deploy1002 Started deploy [airflow-dags/analytics@90f280e]: Regular deploy of Analytics airflow dags [airflow-dags/analytics@90f280ec] [10:51:38] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr3-ulsfo [10:51:51] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@90f280e]: Regular deploy of Analytics airflow dags [airflow-dags/analytics@90f280ec] (duration: 00m 14s) [10:52:48] (03CR) 10Muehlenhoff: ferm: add ensure support to the ferm class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:53:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet [10:55:00] PROBLEM - Check systemd state on debmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-ulsfo [10:57:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T343718)', diff saved to https://phabricator.wikimedia.org/P51878 and previous config saved to /var/cache/conftool/dbconfig/20230829-105746-ladsgroup.json [10:57:53] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:58:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet [10:59:02] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver1002.eqiad.wmnet with reason: host reimage [10:59:20] (03PS3) 10Slyngshede: Email on successful signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 [10:59:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet [10:59:38] (03CR) 10Muehlenhoff: ferm: add ensure support to the ferm class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:59:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet [11:02:15] (03PS1) 10Cparle: Enable temp accounts on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953232 [11:02:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver1002.eqiad.wmnet with reason: host reimage [11:03:08] (03PS2) 10Cparle: Enable temp accounts on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953232 (https://phabricator.wikimedia.org/T324067) [11:04:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P51879 and previous config saved to /var/cache/conftool/dbconfig/20230829-110421-ladsgroup.json [11:04:24] (03PS3) 10Cparle: Enable temp accounts on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953232 (https://phabricator.wikimedia.org/T342067) [11:08:34] !log installing nftables bugfix updates from Bullseye point release [11:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:53] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [11:11:32] (03CR) 10Majavah: [C: 04-1] Stop building stretch images and update monitoring for the docker registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff) [11:11:32] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [11:12:26] (03PS2) 10Muehlenhoff: Stop building stretch images and update monitoring for the docker registry [puppet] - 10https://gerrit.wikimedia.org/r/953201 [11:12:29] (03CR) 10Muehlenhoff: Stop building stretch images and update monitoring for the docker registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff) [11:12:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P51880 and previous config saved to /var/cache/conftool/dbconfig/20230829-111252-ladsgroup.json [11:13:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1010.eqiad.wmnet [11:13:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet [11:14:37] RECOVERY - Check systemd state on debmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1010.eqiad.wmnet [11:19:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T343718)', diff saved to https://phabricator.wikimedia.org/P51881 and previous config saved to /var/cache/conftool/dbconfig/20230829-111927-ladsgroup.json [11:19:28] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin1001" [11:19:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [11:19:34] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:19:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [11:19:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T343718)', diff saved to https://phabricator.wikimedia.org/P51882 and previous config saved to /var/cache/conftool/dbconfig/20230829-111949-ladsgroup.json [11:20:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin1001" [11:20:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver1002.eqiad.wmnet with OS bookworm [11:21:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1010.eqiad.wmnet [11:21:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1010.eqiad.wmnet [11:24:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1011.eqiad.wmnet [11:24:31] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10jbond) 05Open→03In progress [11:24:40] (03PS1) 10Jbond: site.pp: move puppetserver1002 to puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/953234 (https://phabricator.wikimedia.org/T345067) [11:24:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:27:15] (03CR) 10Jbond: [C: 03+2] site.pp: move puppetserver1002 to puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/953234 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [11:27:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P51883 and previous config saved to /var/cache/conftool/dbconfig/20230829-112758-ladsgroup.json [11:28:19] (03CR) 10Clément Goubert: "Looping in releng, because there are some references to stretch images in integration/config:" [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff) [11:29:59] (03PS1) 10Urbanecm: beta: Add ORES channel to logged channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953235 [11:30:03] jouncebot: nowandnext [11:30:03] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [11:30:04] In 0 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1200) [11:30:14] (03CR) 10Urbanecm: [C: 03+2] "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953235 (owner: 10Urbanecm) [11:30:27] (03PS4) 10Urbanecm: Enable temp accounts on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953232 (https://phabricator.wikimedia.org/T342067) (owner: 10Cparle) [11:30:47] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953232 (https://phabricator.wikimedia.org/T342067) (owner: 10Cparle) [11:30:58] (03Merged) 10jenkins-bot: beta: Add ORES channel to logged channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953235 (owner: 10Urbanecm) [11:31:26] (03PS1) 10Effie Mouzeli: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/953236 (https://phabricator.wikimedia.org/T343987) [11:33:55] (03PS1) 10Muehlenhoff: Revert vendor note for concat [puppet] - 10https://gerrit.wikimedia.org/r/953237 [11:35:52] (03PS1) 10Muehlenhoff: Revert "confd: Make confd_prometheus_metrics.py 3.4-compatible" [puppet] - 10https://gerrit.wikimedia.org/r/953238 [11:37:36] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T343718)', diff saved to https://phabricator.wikimedia.org/P51884 and previous config saved to /var/cache/conftool/dbconfig/20230829-113823-ladsgroup.json [11:38:29] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:38:36] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:56] eqsin not feeling good? [11:40:53] claime: there's a jump in load on cache_text [11:41:04] claime: see private [11:41:10] volans: yep [11:42:36] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T343718)', diff saved to https://phabricator.wikimedia.org/P51885 and previous config saved to /var/cache/conftool/dbconfig/20230829-114304-ladsgroup.json [11:43:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [11:43:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [11:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T343718)', diff saved to https://phabricator.wikimedia.org/P51886 and previous config saved to /var/cache/conftool/dbconfig/20230829-114326-ladsgroup.json [11:43:31] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:43:36] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:45:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet [11:47:52] (03CR) 10Hashar: [C: 03+1] "I guess it makes sense given Debian Stretch stopped receiving updates a while ago." [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff) [11:48:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM with minor comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [11:48:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:51:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet [11:51:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1011.eqiad.wmnet [11:53:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P51887 and previous config saved to /var/cache/conftool/dbconfig/20230829-115329-ladsgroup.json [11:56:03] (ProbeDown) resolved: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:24] (03PS5) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161) [12:00:07] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1200) [12:01:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1012.eqiad.wmnet [12:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:02:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T343718)', diff saved to https://phabricator.wikimedia.org/P51888 and previous config saved to /var/cache/conftool/dbconfig/20230829-120221-ladsgroup.json [12:02:27] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:02:38] (03PS4) 10Slyngshede: Email on successful signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 [12:02:43] (03CR) 10Slyngshede: Email on successful signup. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 (owner: 10Slyngshede) [12:03:30] (03CR) 10Jcrespo: [C: 03+1] "No blocker on my side, I don't even remember doing the original patch :-D ! (but I don't handle services with huge dependency on it, haven" [puppet] - 10https://gerrit.wikimedia.org/r/953238 (owner: 10Muehlenhoff) [12:04:21] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "confd: Make confd_prometheus_metrics.py 3.4-compatible" [puppet] - 10https://gerrit.wikimedia.org/r/953238 (owner: 10Muehlenhoff) [12:05:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [12:05:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [12:06:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2140 (T343718)', diff saved to https://phabricator.wikimedia.org/P51889 and previous config saved to /var/cache/conftool/dbconfig/20230829-120603-ladsgroup.json [12:06:54] (03CR) 10Jcrespo: [C: 03+1] "Adding Clément as a heads up, rather than a compulsory reviewer." [puppet] - 10https://gerrit.wikimedia.org/r/953238 (owner: 10Muehlenhoff) [12:07:58] (03CR) 10EoghanGaffney: [C: 03+1] Revert "trafficserver: switch all miscweb services to codfw cname" [puppet] - 10https://gerrit.wikimedia.org/r/950825 (owner: 10Jelto) [12:08:33] (03CR) 10EoghanGaffney: [C: 03+1] gerrit : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [12:08:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P51890 and previous config saved to /var/cache/conftool/dbconfig/20230829-120835-ladsgroup.json [12:08:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 (owner: 10Slyngshede) [12:09:04] (03CR) 10EoghanGaffney: [C: 03+1] aphlict : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952461 (owner: 10Muehlenhoff) [12:11:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:11:30] (03CR) 10Slyngshede: [V: 03+2] Email on successful signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 (owner: 10Slyngshede) [12:11:32] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Email on successful signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 (owner: 10Slyngshede) [12:11:50] (03PS1) 10Arturo Borrero Gonzalez: nftables: fix port statement generation [puppet] - 10https://gerrit.wikimedia.org/r/953246 [12:11:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet [12:13:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T343718)', diff saved to https://phabricator.wikimedia.org/P51891 and previous config saved to /var/cache/conftool/dbconfig/20230829-121305-ladsgroup.json [12:13:14] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:13:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:14:13] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:17] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:17] PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:20] (03PS2) 10Arturo Borrero Gonzalez: nftables: fix port statement generation [puppet] - 10https://gerrit.wikimedia.org/r/953246 [12:14:21] PROBLEM - Host kubestagetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [12:15:08] (03CR) 10CI reject: [V: 04-1] nftables: fix port statement generation [puppet] - 10https://gerrit.wikimedia.org/r/953246 (owner: 10Arturo Borrero Gonzalez) [12:15:27] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [12:15:35] RECOVERY - Host kubestagetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 3.03 ms [12:15:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Ladsgroup) Waiting for approval from data engineering manager as well. [12:15:43] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [12:16:35] RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [12:17:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P51892 and previous config saved to /var/cache/conftool/dbconfig/20230829-121727-ladsgroup.json [12:18:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet [12:18:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1012.eqiad.wmnet [12:18:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:20:40] (03CR) 10Ladsgroup: "Given that it's mariadb-backup package, I'm adding Jaime. Sorry if I missed something obvious." [puppet] - 10https://gerrit.wikimedia.org/r/952881 (owner: 10Muehlenhoff) [12:21:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10fgiunchedi) >>! In T329272#9122382, @ayounsi wrote: > Looking at the `parents` field. > So far we've been defining them man... [12:22:18] (03CR) 10Filippo Giunchedi: [C: 03+2] jaeger: use . as date separator for storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/952783 (https://phabricator.wikimedia.org/T344954) (owner: 10Filippo Giunchedi) [12:23:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1013.eqiad.wmnet [12:23:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T343718)', diff saved to https://phabricator.wikimedia.org/P51893 and previous config saved to /var/cache/conftool/dbconfig/20230829-122342-ladsgroup.json [12:23:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [12:23:48] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:23:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [12:24:00] 10SRE, 10Infrastructure-Foundations, 10netops: xe-3/2/1: down -> Transport: cr1-esams:xe-0/0/7 (Lumen, BDFS2448 80ms 10Gbps wave) {#2013} - https://phabricator.wikimedia.org/T345138 (10ayounsi) > Currently your circuit is being affected by a higher level outage. I will continue to provide updates as I recei... [12:24:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1209 (T343718)', diff saved to https://phabricator.wikimedia.org/P51894 and previous config saved to /var/cache/conftool/dbconfig/20230829-122403-ladsgroup.json [12:24:51] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [12:26:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:27:42] (03PS3) 10Arturo Borrero Gonzalez: nftables: fix port statement generation [puppet] - 10https://gerrit.wikimedia.org/r/953246 [12:27:55] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [12:28:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P51896 and previous config saved to /var/cache/conftool/dbconfig/20230829-122811-ladsgroup.json [12:30:56] (03PS1) 10Slyngshede: C:idm send notification to Moritz and myself [puppet] - 10https://gerrit.wikimedia.org/r/953249 [12:31:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:31:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/953246 (owner: 10Arturo Borrero Gonzalez) [12:31:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: fix port statement generation [puppet] - 10https://gerrit.wikimedia.org/r/953246 (owner: 10Arturo Borrero Gonzalez) [12:32:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P51897 and previous config saved to /var/cache/conftool/dbconfig/20230829-123233-ladsgroup.json [12:33:01] (03PS2) 10Slyngshede: C:idm send notification to Moritz and myself [puppet] - 10https://gerrit.wikimedia.org/r/953249 [12:33:44] (03CR) 10Slyngshede: [C: 03+2] C:idm send notification to Moritz and myself [puppet] - 10https://gerrit.wikimedia.org/r/953249 (owner: 10Slyngshede) [12:34:01] (03CR) 10Slyngshede: C:idm send notification to Moritz and myself [puppet] - 10https://gerrit.wikimedia.org/r/953249 (owner: 10Slyngshede) [12:35:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cluster::cloud_management allow access to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) (owner: 10FNegri) [12:36:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:37:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1013.eqiad.wmnet [12:37:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:38:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953249 (owner: 10Slyngshede) [12:38:34] (03CR) 10Slyngshede: [C: 03+2] C:idm send notification to Moritz and myself [puppet] - 10https://gerrit.wikimedia.org/r/953249 (owner: 10Slyngshede) [12:40:09] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [12:40:23] (03CR) 10FNegri: [C: 04-1] "Thanks Arturo for reviewing." [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) (owner: 10FNegri) [12:41:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:42:18] (03PS1) 10Filippo Giunchedi: pki: restore aux default expiration [puppet] - 10https://gerrit.wikimedia.org/r/953250 (https://phabricator.wikimedia.org/T344253) [12:42:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:43:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P51898 and previous config saved to /var/cache/conftool/dbconfig/20230829-124317-ladsgroup.json [12:43:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1013.eqiad.wmnet [12:43:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1013.eqiad.wmnet [12:44:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [12:44:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T343718)', diff saved to https://phabricator.wikimedia.org/P51899 and previous config saved to /var/cache/conftool/dbconfig/20230829-124438-ladsgroup.json [12:44:44] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:47:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T343718)', diff saved to https://phabricator.wikimedia.org/P51900 and previous config saved to /var/cache/conftool/dbconfig/20230829-124739-ladsgroup.json [12:47:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [12:47:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [12:47:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T343718)', diff saved to https://phabricator.wikimedia.org/P51901 and previous config saved to /var/cache/conftool/dbconfig/20230829-124750-ladsgroup.json [12:50:50] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Bind the admin interface to a socket instead of tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [12:51:57] (03CR) 10JMeybohm: [C: 03+1] pki: restore aux default expiration [puppet] - 10https://gerrit.wikimedia.org/r/953250 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [12:52:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet [12:53:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/953250 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [12:53:45] (03CR) 10Filippo Giunchedi: [C: 03+2] pki: restore aux default expiration [puppet] - 10https://gerrit.wikimedia.org/r/953250 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [12:54:59] (03CR) 10FNegri: "More questions inline :)" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [12:55:17] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:55:19] PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:57:47] (03CR) 10JMeybohm: Add cookbook to configure router's BGP sessions to k8s hosts (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [12:58:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T343718)', diff saved to https://phabricator.wikimedia.org/P51902 and previous config saved to /var/cache/conftool/dbconfig/20230829-125823-ladsgroup.json [12:58:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [12:58:30] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:58:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [12:58:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51903 and previous config saved to /var/cache/conftool/dbconfig/20230829-125844-ladsgroup.json [12:59:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet [12:59:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [12:59:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P51904 and previous config saved to /var/cache/conftool/dbconfig/20230829-125944-ladsgroup.json [12:59:51] (03CR) 10JMeybohm: [C: 03+1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1300) [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:25] RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [13:00:29] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [13:00:39] nothing to deploy, yay [13:00:40] (03PS5) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) [13:00:42] (03PS5) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [13:00:44] (03PS1) 10FNegri: [openstack] remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/953252 [13:01:34] (03CR) 10JMeybohm: [C: 03+1] "Sounds right" [puppet] - 10https://gerrit.wikimedia.org/r/952198 (owner: 10Muehlenhoff) [13:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:01:41] (03PS1) 10Volans: sre.ganeti.*: fix auto-detection of Ganeti group [cookbooks] - 10https://gerrit.wikimedia.org/r/953253 (https://phabricator.wikimedia.org/T344813) [13:01:52] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm fails when no group is set - https://phabricator.wikimedia.org/T344813 (10Volans) a:03Volans [13:02:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1015.eqiad.wmnet [13:02:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:04:47] (03CR) 10Majavah: [C: 04-1] "-1 for the repo URL, and a question." [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:06:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T343718)', diff saved to https://phabricator.wikimedia.org/P51905 and previous config saved to /var/cache/conftool/dbconfig/20230829-130656-ladsgroup.json [13:07:02] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:14:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P51906 and previous config saved to /var/cache/conftool/dbconfig/20230829-131451-ladsgroup.json [13:14:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet [13:17:29] PROBLEM - Host kubestagetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:20:45] RECOVERY - Host kubestagetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [13:21:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet [13:21:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1015.eqiad.wmnet [13:21:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1016.eqiad.wmnet [13:22:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [13:22:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P51907 and previous config saved to /var/cache/conftool/dbconfig/20230829-132202-ladsgroup.json [13:22:55] (03PS1) 10Herron: Revert "trafficserver: Use svc urls for eqiad/codfw" [puppet] - 10https://gerrit.wikimedia.org/r/953207 [13:23:19] (03CR) 10CI reject: [V: 04-1] Revert "trafficserver: Use svc urls for eqiad/codfw" [puppet] - 10https://gerrit.wikimedia.org/r/953207 (owner: 10Herron) [13:23:25] (03PS2) 10Herron: Revert "trafficserver: Use svc urls for eqiad/codfw" [puppet] - 10https://gerrit.wikimedia.org/r/953207 (https://phabricator.wikimedia.org/T326657) [13:23:48] (03CR) 10CI reject: [V: 04-1] Revert "trafficserver: Use svc urls for eqiad/codfw" [puppet] - 10https://gerrit.wikimedia.org/r/953207 (https://phabricator.wikimedia.org/T326657) (owner: 10Herron) [13:24:08] (03PS3) 10Herron: Revert "trafficserver: Use svc urls for eqiad/codfw" [puppet] - 10https://gerrit.wikimedia.org/r/953207 (https://phabricator.wikimedia.org/T326657) [13:24:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [13:24:20] (03CR) 10JMeybohm: [C: 03+1] Stop building stretch images and update monitoring for the docker registry [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff) [13:24:31] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel: puppetmaster1006 to puppetserver1002 - https://phabricator.wikimedia.org/T345080 (10VRiley-WMF) Relabeled puppetmaster1006 to puppetserver1002 as requested. [13:25:15] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [13:25:22] (03CR) 10Herron: [C: 03+2] Revert "trafficserver: Use svc urls for eqiad/codfw" [puppet] - 10https://gerrit.wikimedia.org/r/953207 (https://phabricator.wikimedia.org/T326657) (owner: 10Herron) [13:25:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2025'] [13:25:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/953253 (https://phabricator.wikimedia.org/T344813) (owner: 10Volans) [13:26:32] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [13:26:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2025'] [13:27:20] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel: puppetmaster1006 to puppetserver1002 - https://phabricator.wikimedia.org/T345080 (10VRiley-WMF) 05Open→03Resolved Physical relabeling completed. [13:27:23] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10VRiley-WMF) [13:28:11] !log installing openssl security updates on buster [13:28:11] (03CR) 10Fabian Kaelin: [C: 03+1] miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [13:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:50] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [13:29:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2025'] [13:29:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T343718)', diff saved to https://phabricator.wikimedia.org/P51908 and previous config saved to /var/cache/conftool/dbconfig/20230829-132957-ladsgroup.json [13:29:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [13:30:02] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:30:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [13:30:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1211 (T343718)', diff saved to https://phabricator.wikimedia.org/P51909 and previous config saved to /var/cache/conftool/dbconfig/20230829-133018-ladsgroup.json [13:37:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P51910 and previous config saved to /var/cache/conftool/dbconfig/20230829-133708-ladsgroup.json [13:37:12] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10VRiley-WMF) Updated physical labeling as requested. [13:37:45] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:38:10] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe [13:38:11] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [13:38:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2025'] [13:39:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [13:40:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1016.eqiad.wmnet [13:41:41] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:48] (03PS1) 10Volans: sre.ganeti.makevm: fix call to Ganeti-Netbox sync [cookbooks] - 10https://gerrit.wikimedia.org/r/953257 (https://phabricator.wikimedia.org/T344812) [13:41:49] PROBLEM - Host kubetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:42:22] (03PS3) 10Elukey: eventgate: set a more performant default for queue.buffering.max.ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) [13:43:07] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:19] (03PS1) 10Hnowlan: geo-analytics: use ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/953258 (https://phabricator.wikimedia.org/T336400) [13:43:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [13:43:46] (03CR) 10Volans: "From Netbox:" [cookbooks] - 10https://gerrit.wikimedia.org/r/953257 (https://phabricator.wikimedia.org/T344812) (owner: 10Volans) [13:44:11] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [13:44:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2025'] [13:44:53] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [13:45:21] (03PS1) 10Ilias Sarantopoulos: ores-extension: replace first batch of wikis model thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953259 (https://phabricator.wikimedia.org/T343308) [13:45:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2025'] [13:45:25] RECOVERY - Host kubetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms [13:45:34] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Spicerack, and 2 others: cookbook sre.ganeti.makevm calls wrong netbox_ganeti_codfw_sync.service - https://phabricator.wikimedia.org/T344812 (10Volans) 05Open→03In progress a:03Volans [13:45:40] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm fails when no group is set - https://phabricator.wikimedia.org/T344813 (10Volans) 05Open→03In progress [13:45:48] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm fails when no group is set - https://phabricator.wikimedia.org/T344813 (10Volans) p:05Triage→03Medium [13:46:06] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Spicerack, and 2 others: cookbook sre.ganeti.makevm calls wrong netbox_ganeti_codfw_sync.service - https://phabricator.wikimedia.org/T344812 (10Volans) p:05Triage→03Medium [13:46:25] (03CR) 10Elukey: [C: 03+2] eventgate: set a more performant default for queue.buffering.max.ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [13:46:55] jouncebot: next [13:46:55] In 2 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1600) [13:47:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1016.eqiad.wmnet [13:47:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1016.eqiad.wmnet [13:48:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [13:48:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T343718)', diff saved to https://phabricator.wikimedia.org/P51911 and previous config saved to /var/cache/conftool/dbconfig/20230829-134854-ladsgroup.json [13:49:00] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:49:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['kubernetes2025'] [13:49:12] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [13:49:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2025'] [13:49:37] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:52] (03PS4) 10Hnowlan: service, conftool: add base configuration for geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947864 (https://phabricator.wikimedia.org/T336400) [13:50:21] (03PS5) 10Cparle: Enable temp accounts on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953232 (https://phabricator.wikimedia.org/T342067) [13:50:32] (03CR) 10Ladsgroup: "the beta cluster piece seems unrelated. Otherwise looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953259 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [13:51:01] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:02] (03CR) 10JMeybohm: [C: 03+1] geo-analytics: use ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/953258 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [13:51:28] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43051/console" [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [13:52:05] (03PS6) 10FNegri: [openstack] New files/templates for Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [13:52:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T343718)', diff saved to https://phabricator.wikimedia.org/P51912 and previous config saved to /var/cache/conftool/dbconfig/20230829-135214-ladsgroup.json [13:52:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:52:29] (03PS2) 10FNegri: [openstack] remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/953252 [13:52:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:52:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T343718)', diff saved to https://phabricator.wikimedia.org/P51913 and previous config saved to /var/cache/conftool/dbconfig/20230829-135236-ladsgroup.json [13:52:47] (03CR) 10JMeybohm: [C: 03+1] kubernetes: add users for media_analytics service, cassandra config [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [13:53:10] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] kubernetes: add users for media_analytics service, cassandra config [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [13:53:44] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [13:53:57] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [13:54:47] (03PS1) 10Effie Mouzeli: Update mathoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953261 (https://phabricator.wikimedia.org/T300033) [13:55:21] (03CR) 10Urbanecm: [C: 03+2] Enable temp accounts on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953232 (https://phabricator.wikimedia.org/T342067) (owner: 10Cparle) [13:55:29] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync [13:56:00] (03Merged) 10jenkins-bot: Enable temp accounts on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953232 (https://phabricator.wikimedia.org/T342067) (owner: 10Cparle) [13:56:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye [13:56:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [13:56:11] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync [13:56:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye [13:56:25] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [13:59:11] (03PS7) 10Jbond: cumin: update cumin host to use the puppetdb-micro service [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) [13:59:19] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [14:00:13] (03CR) 10JHathaway: [C: 03+1] "woohoo!" [puppet] - 10https://gerrit.wikimedia.org/r/953237 (owner: 10Muehlenhoff) [14:00:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43052/console" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [14:00:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/953253 (https://phabricator.wikimedia.org/T344813) (owner: 10Volans) [14:01:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/953257 (https://phabricator.wikimedia.org/T344812) (owner: 10Volans) [14:02:06] (03CR) 10Hnowlan: [C: 03+2] geo-analytics: use ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/953258 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [14:02:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/953237 (owner: 10Muehlenhoff) [14:03:04] (03PS1) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) [14:03:18] (03Merged) 10jenkins-bot: geo-analytics: use ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/953258 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [14:04:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P51914 and previous config saved to /var/cache/conftool/dbconfig/20230829-140400-ladsgroup.json [14:05:01] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [14:05:05] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [14:05:09] (03CR) 10Muehlenhoff: [C: 03+2] Revert vendor note for concat [puppet] - 10https://gerrit.wikimedia.org/r/953237 (owner: 10Muehlenhoff) [14:05:10] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1017.eqiad.wmnet [14:05:23] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host restbase1017.eqiad.wmnet [14:05:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [14:05:29] (03CR) 10Volans: [C: 03+2] sre.ganeti.*: fix auto-detection of Ganeti group [cookbooks] - 10https://gerrit.wikimedia.org/r/953253 (https://phabricator.wikimedia.org/T344813) (owner: 10Volans) [14:05:44] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [14:05:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1017.eqiad.wmnet [14:05:59] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [14:06:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:06:14] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1017.eqiad.wmnet [14:06:17] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host restbase1017.eqiad.wmnet [14:06:23] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [14:06:38] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1017.eqiad.wmnet [14:06:56] (03CR) 10Hnowlan: [C: 03+2] kubernetes: add users for media_analytics service, cassandra config [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [14:07:10] (03CR) 10Hnowlan: kubernetes: add users for media_analytics service, cassandra config [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [14:07:10] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync [14:07:19] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync [14:08:07] !log start rebooting ncredir hosts for T344587 [14:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [14:08:31] (03Merged) 10jenkins-bot: sre.ganeti.*: fix auto-detection of Ganeti group [cookbooks] - 10https://gerrit.wikimedia.org/r/953253 (https://phabricator.wikimedia.org/T344813) (owner: 10Volans) [14:08:40] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [14:08:56] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/953257 (https://phabricator.wikimedia.org/T344812) (owner: 10Volans) [14:10:11] (03CR) 10Volans: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [14:10:28] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: fix call to Ganeti-Netbox sync [cookbooks] - 10https://gerrit.wikimedia.org/r/953257 (https://phabricator.wikimedia.org/T344812) (owner: 10Volans) [14:10:40] (03PS1) 10JMeybohm: WIP: Publish AQS cassandra nodes for helm deployments [puppet] - 10https://gerrit.wikimedia.org/r/953267 [14:10:50] (03PS30) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:10:52] (03CR) 10David Caro: replica_cnf_api: add envvars backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:11:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T343718)', diff saved to https://phabricator.wikimedia.org/P51915 and previous config saved to /var/cache/conftool/dbconfig/20230829-141147-ladsgroup.json [14:11:54] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43053/console" [puppet] - 10https://gerrit.wikimedia.org/r/953267 (owner: 10JMeybohm) [14:12:03] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:12:26] (03PS2) 10JMeybohm: WIP: Publish AQS cassandra nodes for helm deployments [puppet] - 10https://gerrit.wikimedia.org/r/953267 [14:13:15] (03Merged) 10jenkins-bot: sre.ganeti.makevm: fix call to Ganeti-Netbox sync [cookbooks] - 10https://gerrit.wikimedia.org/r/953257 (https://phabricator.wikimedia.org/T344812) (owner: 10Volans) [14:14:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:35] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1017.eqiad.wmnet [14:18:07] (03PS1) 10Filippo Giunchedi: mesh: add KUBERNETES_NODE (spec.nodeName) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953268 (https://phabricator.wikimedia.org/T320563) [14:18:32] (03PS1) 10Fabfur: admin: removed xterm TERM setting as it mess with screen detection [puppet] - 10https://gerrit.wikimedia.org/r/953269 [14:18:56] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:01] (03PS16) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:19:03] (03PS16) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:19:05] (03PS16) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:19:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P51916 and previous config saved to /var/cache/conftool/dbconfig/20230829-141907-ladsgroup.json [14:19:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:22] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1018.eqiad.wmnet [14:19:56] (03PS2) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) [14:20:12] (03CR) 10Fabfur: [C: 03+2] admin: removed xterm TERM setting as it mess with screen detection [puppet] - 10https://gerrit.wikimedia.org/r/953269 (owner: 10Fabfur) [14:20:30] (03CR) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [14:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:21:43] (03CR) 10Eevans: [C: 03+1] kubernetes: add users for media_analytics service, cassandra config [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [14:21:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet [14:22:56] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir1001.eqiad.wmnet} and A:ncredir [14:24:04] 10SRE, 10Bitu, 10Infrastructure-Foundations: Live validation of usernames - https://phabricator.wikimedia.org/T345168 (10SLyngshede-WMF) [14:24:24] (03PS11) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [14:24:37] (03PS1) 10Muehlenhoff: uwsgi: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953273 [14:24:40] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [14:25:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet [14:25:04] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr4-ulsfo [14:25:09] 10SRE, 10Bitu, 10Infrastructure-Foundations: Live validation of usernames - https://phabricator.wikimedia.org/T345168 (10SLyngshede-WMF) [14:25:24] 10SRE, 10Bitu, 10Infrastructure-Foundations: Live validation of usernames - https://phabricator.wikimedia.org/T345168 (10SLyngshede-WMF) p:05Triage→03Low [14:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1017.eqiad.wmnet [14:26:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P51917 and previous config saved to /var/cache/conftool/dbconfig/20230829-142653-ladsgroup.json [14:27:00] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir1001.eqiad.wmnet} and A:ncredir [14:27:01] (03CR) 10JMeybohm: [C: 03+1] service, conftool: add base configuration for geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947864 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [14:27:11] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir4001.ulsfo.wmnet} and A:ncredir [14:28:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1018.eqiad.wmnet [14:28:22] (03CR) 10Hnowlan: [C: 03+2] service, conftool: add base configuration for geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947864 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [14:28:24] (03PS1) 10Muehlenhoff: openstack: Remove obsolete client classes [puppet] - 10https://gerrit.wikimedia.org/r/953274 [14:28:33] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir5001.*} and A:ncredir [14:28:38] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=97) rolling reboot on P{ncredir5001.*} and A:ncredir [14:28:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T343718)', diff saved to https://phabricator.wikimedia.org/P51918 and previous config saved to /var/cache/conftool/dbconfig/20230829-142843-ladsgroup.json [14:28:49] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:28:57] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir5001.eqsin.wmnet} and A:ncredir [14:29:22] (03PS17) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:29:24] (03PS17) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:29:26] (03PS17) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:29:35] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:29:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr4-ulsfo [14:30:11] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-22-ulsfo [14:30:48] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [14:31:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1018.eqiad.wmnet [14:31:19] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir4001.ulsfo.wmnet} and A:ncredir [14:32:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw2-22-ulsfo [14:32:58] (03CR) 10FNegri: [C: 03+1] "LGTM, but I'd wait for a +1 from Volans as well." [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [14:34:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T343718)', diff saved to https://phabricator.wikimedia.org/P51919 and previous config saved to /var/cache/conftool/dbconfig/20230829-143413-ladsgroup.json [14:34:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [14:34:18] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:34:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [14:34:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1214 (T343718)', diff saved to https://phabricator.wikimedia.org/P51920 and previous config saved to /var/cache/conftool/dbconfig/20230829-143434-ladsgroup.json [14:34:37] (03PS3) 10FNegri: [openstack] remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) [14:34:47] (03CR) 10Hnowlan: [C: 03+2] cassandra-http-gateway: remove typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/951964 (owner: 10Kamila Součková) [14:35:14] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir5001.eqsin.wmnet} and A:ncredir [14:35:42] (03Merged) 10jenkins-bot: cassandra-http-gateway: remove typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/951964 (owner: 10Kamila Součková) [14:36:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye [14:36:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [14:37:07] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir6001.drmrs.wmnet} and A:ncredir [14:37:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll merge based on the previous +1 and the addressing of the last round of comments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:37:40] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir2001.codfw.wmnet} and A:ncredir [14:38:10] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [14:38:10] (03Merged) 10jenkins-bot: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:38:15] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1021.eqiad.wmnet [14:38:19] (03PS8) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [14:38:21] (03PS1) 10Jbond: firewall: move contrac logic to firewall module [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) [14:38:44] (03CR) 10CI reject: [V: 04-1] ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [14:38:47] (03CR) 10CI reject: [V: 04-1] firewall: move contrac logic to firewall module [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [14:39:11] (03PS18) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:39:13] (03PS18) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:39:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [14:39:23] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: add wikifunctions entry [puppet] - 10https://gerrit.wikimedia.org/r/952000 (https://phabricator.wikimedia.org/T344147) (owner: 10Urbanecm) [14:40:03] (03PS2) 10Jbond: firewall: move contrac logic to firewall module [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) [14:40:05] (03PS9) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [14:40:43] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm fails when no group is set - https://phabricator.wikimedia.org/T344813 (10Volans) 05In progress→03Resolved [14:40:52] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Spicerack, 10User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm calls wrong netbox_ganeti_codfw_sync.service - https://phabricator.wikimedia.org/T344812 (10Volans) 05In progress→03Resolved [14:40:56] (03CR) 10Andrew Bogott: [C: 03+1] "fancy!" [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [14:41:26] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir6001.drmrs.wmnet} and A:ncredir [14:41:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:41:36] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir2001.codfw.wmnet} and A:ncredir [14:41:46] jouncebot: nowandnext [14:41:46] No deployments scheduled for the next 1 hour(s) and 18 minute(s) [14:41:47] In 1 hour(s) and 18 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1600) [14:41:48] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir1002.eqiad.wmnet} and A:ncredir [14:41:51] coool [14:42:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P51921 and previous config saved to /var/cache/conftool/dbconfig/20230829-144159-ladsgroup.json [14:42:18] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir2002.codfw.wmnet} and A:ncredir [14:42:39] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir4002.ulsfo.wmnet} and A:ncredir [14:43:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:43:28] (03PS2) 10Effie Mouzeli: Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947805 (https://phabricator.wikimedia.org/T300033) [14:43:45] (03CR) 10Majavah: [C: 03+1] openstack: Remove obsolete client classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953274 (owner: 10Muehlenhoff) [14:43:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P51922 and previous config saved to /var/cache/conftool/dbconfig/20230829-144349-ladsgroup.json [14:43:56] (03Merged) 10jenkins-bot: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:44:30] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw1-b1-codfw [14:44:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-b1-codfw [14:44:33] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw1-c8-eqiad [14:45:09] (03PS3) 10Effie Mouzeli: Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947805 (https://phabricator.wikimedia.org/T300033) [14:45:11] (03PS6) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) [14:45:13] (03PS7) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [14:45:15] (03PS4) 10FNegri: [openstack] remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/953252 [14:45:18] (03CR) 10CI reject: [V: 04-1] Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947805 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [14:46:03] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir1002.eqiad.wmnet} and A:ncredir [14:46:08] (03Abandoned) 10Effie Mouzeli: Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947805 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [14:46:19] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:25] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir2002.codfw.wmnet} and A:ncredir [14:46:35] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [14:46:47] (03PS1) 10Ladsgroup: Enable url shortener in sidebar in RTL and some non-latin wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953279 (https://phabricator.wikimedia.org/T267921) [14:46:54] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir5002.eqsin.wmnet} and A:ncredir [14:46:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir4002.ulsfo.wmnet} and A:ncredir [14:47:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-c8-eqiad [14:47:05] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw1-d5-eqiad [14:47:07] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on P{ncredir6002.drmrs.wmnet} and A:ncredir [14:47:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [14:47:57] (03PS10) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [14:48:02] (03PS8) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [14:48:09] (03PS5) 10FNegri: [openstack] remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) [14:48:12] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1021.eqiad.wmnet [14:48:15] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1028.eqiad.wmnet [14:48:29] (03CR) 10CI reject: [V: 04-1] New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [14:48:58] (03CR) 10Jbond: "thanks for the feedback see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [14:49:09] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:22] (03PS1) 10Ladsgroup: Init patch for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953281 (https://phabricator.wikimedia.org/T345166) [14:49:24] (03PS1) 10Effie Mouzeli: Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953282 (https://phabricator.wikimedia.org/T300033) [14:49:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-d5-eqiad [14:49:36] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw1-e4-eqiad [14:49:44] (03PS1) 10JMeybohm: AQS2: Move common settings (AQS cassandra nodes) to _aqs2-common_ [deployment-charts] - 10https://gerrit.wikimedia.org/r/953283 (https://phabricator.wikimedia.org/T336400) [14:50:07] PROBLEM - PyBal IPVS diff check on lvs5006 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:50:09] PROBLEM - PyBal IPVS diff check on lvs5004 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:50:20] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1124.eqiad.wmnet with OS bullseye [14:50:23] (03CR) 10Effie Mouzeli: [C: 03+2] Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953282 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [14:50:36] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:36] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:54] all good fabfur? [14:51:06] do you need help? [14:51:07] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1125.eqiad.wmnet with OS bullseye [14:51:16] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir6002.drmrs.wmnet} and A:ncredir [14:51:17] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on P{ncredir5002.eqsin.wmnet} and A:ncredir [14:51:33] (03PS19) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:51:42] I acked the page [14:51:51] thx claime [14:51:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-e4-eqiad [14:51:54] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw1-f4-eqiad [14:52:05] ack [14:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T343718)', diff saved to https://phabricator.wikimedia.org/P51923 and previous config saved to /var/cache/conftool/dbconfig/20230829-145242-ladsgroup.json [14:52:48] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:54:00] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [14:54:02] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [14:54:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-f4-eqiad [14:54:15] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw2-d5-eqiad [14:55:12] (03CR) 10Elukey: [C: 03+1] uwsgi: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953273 (owner: 10Muehlenhoff) [14:55:27] (03CR) 10Jbond: [V: 03+1] cumin: update cumin host to use the puppetdb-micro service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [14:55:31] RECOVERY - PyBal IPVS diff check on lvs5004 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:55:31] RECOVERY - PyBal IPVS diff check on lvs5006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:55:34] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [14:55:36] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:36] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [14:55:36] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:44] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync [14:55:46] (03PS2) 10Jbond: puppetdb-api: swap the production and next environments [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) [14:56:18] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync [14:56:20] (03PS2) 10Ladsgroup: Init patch for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953281 (https://phabricator.wikimedia.org/T345166) [14:56:27] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw2-d5-eqiad [14:56:49] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [14:56:52] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [14:56:57] oh restbase... [14:57:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1028.eqiad.wmnet [14:57:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T343718)', diff saved to https://phabricator.wikimedia.org/P51924 and previous config saved to /var/cache/conftool/dbconfig/20230829-145705-ladsgroup.json [14:57:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [14:57:08] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1031.eqiad.wmnet [14:57:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [14:57:27] urandom: It's been noisy for the past few days hasn't it [14:57:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T343718)', diff saved to https://phabricator.wikimedia.org/P51925 and previous config saved to /var/cache/conftool/dbconfig/20230829-145727-ladsgroup.json [14:58:27] well... for various reasons, yes [14:58:48] this, is because I'm rebooting hosts [14:58:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P51926 and previous config saved to /var/cache/conftool/dbconfig/20230829-145856-ladsgroup.json [14:59:21] which should be fine... I've no idea why it's being so sensitive to this [14:59:33] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw1-b12-drmrs [15:00:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet [15:00:23] (03CR) 10Gmodena: Increase the kafka-jumbo maximum message size to 10 MB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [15:01:31] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.147:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.147:7231/en.wikipedia.org/v1/media/m [15:01:31] k/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:01:58] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [15:02:11] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:02:25] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:02:26] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [15:02:56] (03PS7) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) [15:03:29] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1124.eqiad.wmnet with reason: host reimage [15:03:31] (03CR) 10CI reject: [V: 04-1] vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:03:36] (03CR) 10Ladsgroup: [C: 03+2] Enable url shortener in sidebar in RTL and some non-latin wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953279 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [15:03:38] (03PS1) 10Effie Mouzeli: Update cxserver to use certmanager certs (modules) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953285 (https://phabricator.wikimedia.org/T300033) [15:04:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b12-drmrs [15:04:15] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw1-b13-drmrs [15:04:17] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1125.eqiad.wmnet with reason: host reimage [15:04:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953279 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [15:04:21] (03Merged) 10jenkins-bot: Enable url shortener in sidebar in RTL and some non-latin wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953279 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [15:04:36] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:953279|Enable url shortener in sidebar in RTL and some non-latin wikis (T267921)]] [15:04:42] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [15:04:48] I'm going to try spacing them out (even) more. Hopefully I can get the August reboots done while it's still... you know, August. [15:05:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Ladsgroup) Hi @Mabualruz, you need to confirm your ssh key out of band with me. [15:05:33] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1031.eqiad.wmnet [15:05:35] just keep counting past 31, and you can finish up on August 36th or whatever :) [15:05:57] (03CR) 10JMeybohm: [C: 04-1] "I would suggest to rebase onto Ia1d320981309e6821cee0ab73a73607c0ecfeace to get rid of the re-definitions for cassandra." [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [15:06:04] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:953279|Enable url shortener in sidebar in RTL and some non-latin wikis (T267921)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [15:06:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet [15:06:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:34] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1124.eqiad.wmnet with reason: host reimage [15:06:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1018.eqiad.wmnet [15:06:44] (03CR) 10Muehlenhoff: "Looks good, typo/comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [15:06:52] bblack: ha! [15:07:09] (03CR) 10JMeybohm: [C: 03+1] Update mathoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953261 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [15:07:34] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P51927 and previous config saved to /var/cache/conftool/dbconfig/20230829-150749-ladsgroup.json [15:07:51] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Add new minor version 1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/952901 (owner: 10JMeybohm) [15:08:01] (03PS20) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [15:08:03] (03PS1) 10Alexandros Kosiaris: MariaDB egress: Brown paper bag fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/953286 (https://phabricator.wikimedia.org/T340843) [15:08:07] (03Abandoned) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:08:21] (03Merged) 10jenkins-bot: mesh.configuration: Add new minor version 1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/952901 (owner: 10JMeybohm) [15:08:25] (03CR) 10AOkoth: [C: 03+2] vrts: send /var/log/{clamav,freshclam}.log to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/945781 (owner: 10AOkoth) [15:08:38] (03Merged) 10jenkins-bot: mesh.configuration: Bind the admin interface to a socket instead of tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [15:08:43] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1125.eqiad.wmnet with reason: host reimage [15:08:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b13-drmrs [15:08:58] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw1-bw27-esams [15:09:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] MariaDB egress: Brown paper bag fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/953286 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [15:09:33] (03CR) 10JMeybohm: [C: 03+1] "You could now even update to mesh.configuration 1.4.0 from" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953261 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [15:09:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1019.eqiad.wmnet [15:10:00] (03Merged) 10jenkins-bot: MariaDB egress: Brown paper bag fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/953286 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [15:10:27] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [15:11:36] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:04] (03PS2) 10Effie Mouzeli: Update cxserver to use certmanager certs (modules) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953285 (https://phabricator.wikimedia.org/T300033) [15:13:14] (03PS3) 10Effie Mouzeli: Update cxserver to use certmanager certs (modules) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953285 (https://phabricator.wikimedia.org/T300033) [15:13:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-bw27-esams [15:13:43] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw1-by27-esams [15:14:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T343718)', diff saved to https://phabricator.wikimedia.org/P51928 and previous config saved to /var/cache/conftool/dbconfig/20230829-151402-ladsgroup.json [15:14:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [15:14:08] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:14:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [15:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T343718)', diff saved to https://phabricator.wikimedia.org/P51929 and previous config saved to /var/cache/conftool/dbconfig/20230829-151423-ladsgroup.json [15:14:50] (03PS5) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:15:27] (03PS2) 10Muehlenhoff: openstack: Remove obsolete client classes [puppet] - 10https://gerrit.wikimedia.org/r/953274 [15:15:31] (03CR) 10Muehlenhoff: openstack: Remove obsolete client classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953274 (owner: 10Muehlenhoff) [15:15:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:57] (03CR) 10Majavah: [C: 03+1] openstack: Remove obsolete client classes [puppet] - 10https://gerrit.wikimedia.org/r/953274 (owner: 10Muehlenhoff) [15:16:05] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: sync [15:16:14] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: sync [15:16:22] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:953279|Enable url shortener in sidebar in RTL and some non-latin wikis (T267921)]] (duration: 11m 46s) [15:16:25] (03PS3) 10Ladsgroup: Init patch for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953281 (https://phabricator.wikimedia.org/T345166) [15:16:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T343718)', diff saved to https://phabricator.wikimedia.org/P51930 and previous config saved to /var/cache/conftool/dbconfig/20230829-151625-ladsgroup.json [15:16:28] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [15:16:43] (03PS1) 10AOkoth: vrts: add test VM [puppet] - 10https://gerrit.wikimedia.org/r/953289 (https://phabricator.wikimedia.org/T340027) [15:16:44] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:31] (03CR) 10Ladsgroup: [C: 03+2] Init patch for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953281 (https://phabricator.wikimedia.org/T345166) (owner: 10Ladsgroup) [15:18:00] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-by27-esams [15:18:28] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-22-ulsfo [15:18:33] (03CR) 10Hnowlan: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953283 (https://phabricator.wikimedia.org/T336400) (owner: 10JMeybohm) [15:18:39] (03Merged) 10jenkins-bot: Init patch for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953281 (https://phabricator.wikimedia.org/T345166) (owner: 10Ladsgroup) [15:18:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51931 and previous config saved to /var/cache/conftool/dbconfig/20230829-151857-ladsgroup.json [15:18:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1019.eqiad.wmnet [15:19:10] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:14] (03PS4) 10Effie Mouzeli: Update cxserver to use certmanager certs (modules) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953285 (https://phabricator.wikimedia.org/T300033) [15:19:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1019.eqiad.wmnet [15:19:40] (03CR) 10AOkoth: [C: 03+2] vrts: add test VM [puppet] - 10https://gerrit.wikimedia.org/r/953289 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:19:50] (03PS2) 10AOkoth: vrts: add test VM [puppet] - 10https://gerrit.wikimedia.org/r/953289 (https://phabricator.wikimedia.org/T340027) [15:19:57] (03CR) 10AOkoth: [V: 03+2] vrts: add test VM [puppet] - 10https://gerrit.wikimedia.org/r/953289 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:19:58] !log ladsgroup@deploy1002 Started scap: Creating tlywiki (T345166) [15:20:04] T345166: Create Wikipedia Talysh - https://phabricator.wikimedia.org/T345166 [15:20:15] (03PS3) 10Hnowlan: kubernetes: add users for media_analytics service, cassandra config [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) [15:20:50] (03CR) 10Effie Mouzeli: [C: 03+2] Update cxserver to use certmanager certs (modules) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953285 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [15:21:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw2-22-ulsfo [15:21:08] (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (036 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:21:09] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-a7-eqiad [15:21:19] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:21:49] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:21:54] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:21:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet [15:21:58] !log aokoth@cumin1001 START - Cookbook sre.ganeti.makevm for new host vrts1002.eqiad.wmnet [15:21:59] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [15:22:08] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:22:37] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:22:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P51932 and previous config saved to /var/cache/conftool/dbconfig/20230829-152255-ladsgroup.json [15:23:32] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync [15:23:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:23:49] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync [15:24:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw2-a7-eqiad [15:24:09] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-b2-eqiad [15:24:46] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts1002.eqiad.wmnet - aokoth@cumin1001" [15:24:49] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:24:51] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:24:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Papaul) a:03Papaul [15:25:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:25:33] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts1002.eqiad.wmnet - aokoth@cumin1001" [15:25:33] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:33] !log aokoth@cumin1001 START - Cookbook sre.dns.wipe-cache vrts1002.eqiad.wmnet on all recursors [15:25:36] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) vrts1002.eqiad.wmnet on all recursors [15:25:39] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:25:44] (03PS1) 10JMeybohm: wikifunctions: Don't expose envoy admin and k8s service accounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/953291 (https://phabricator.wikimedia.org/T343709) [15:25:56] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [15:26:48] (03CR) 10JMeybohm: "Feel free to amend and/or deploy when you feel ready" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952782 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [15:27:02] !log ladsgroup@deploy1002 Finished scap: Creating tlywiki (T345166) (duration: 07m 03s) [15:27:08] T345166: Create Wikipedia Talysh - https://phabricator.wikimedia.org/T345166 [15:27:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw2-b2-eqiad [15:27:19] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-c2-eqiad [15:27:26] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [15:27:31] (03CR) 10JMeybohm: "Feel free to deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953291 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [15:27:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bullseye [15:27:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1019.eqiad.wmnet [15:27:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1019.eqiad.wmnet [15:27:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye [15:27:56] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1124.eqiad.wmnet with OS bullseye [15:28:20] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [15:28:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:28:53] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device asw2-c2-eqiad [15:28:56] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:07] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:29:51] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM vrts1002.eqiad.wmnet - aokoth@cumin1001" [15:29:59] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: sync [15:30:18] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync [15:30:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Papaul) @Jclark-ctr @VRiley-WMF i am taking over this task to finish it for the cloud team. thanks [15:30:33] (03PS1) 10Jbond: tox: add python 3.11 [software/cumin] - 10https://gerrit.wikimedia.org/r/953294 [15:30:35] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM vrts1002.eqiad.wmnet - aokoth@cumin1001" [15:30:35] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:36] !log aokoth@cumin1001 START - Cookbook sre.dns.wipe-cache vrts1002.eqiad.wmnet on all recursors [15:30:37] (03PS1) 10Jbond: puppetdb: optimize query [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 [15:30:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:40] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) vrts1002.eqiad.wmnet on all recursors [15:30:40] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host vrts1002.eqiad.wmnet [15:30:52] (03PS1) 10Ladsgroup: Add tly to langlist [dns] - 10https://gerrit.wikimedia.org/r/953296 (https://phabricator.wikimedia.org/T345166) [15:31:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P51933 and previous config saved to /var/cache/conftool/dbconfig/20230829-153132-ladsgroup.json [15:31:48] (03CR) 10CI reject: [V: 04-1] Add tly to langlist [dns] - 10https://gerrit.wikimedia.org/r/953296 (https://phabricator.wikimedia.org/T345166) (owner: 10Ladsgroup) [15:31:52] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1125.eqiad.wmnet with OS bullseye [15:31:59] (03PS2) 10Ladsgroup: Add tly to langlist [dns] - 10https://gerrit.wikimedia.org/r/953296 (https://phabricator.wikimedia.org/T345166) [15:32:06] (03CR) 10Arturo Borrero Gonzalez: "LGTM, minor comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [15:32:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudservices1006'] [15:33:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: Remove obsolete client classes [puppet] - 10https://gerrit.wikimedia.org/r/953274 (owner: 10Muehlenhoff) [15:33:20] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:24] (03CR) 10Ladsgroup: [C: 03+2] Add tly to langlist [dns] - 10https://gerrit.wikimedia.org/r/953296 (https://phabricator.wikimedia.org/T345166) (owner: 10Ladsgroup) [15:33:52] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:34:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P51934 and previous config saved to /var/cache/conftool/dbconfig/20230829-153403-ladsgroup.json [15:34:06] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [15:34:16] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [15:34:56] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [15:35:42] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [15:36:06] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:45] (03CR) 10CI reject: [V: 04-1] tox: add python 3.11 [software/cumin] - 10https://gerrit.wikimedia.org/r/953294 (owner: 10Jbond) [15:37:03] (03CR) 10CI reject: [V: 04-1] puppetdb: optimize query [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 (owner: 10Jbond) [15:37:44] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [15:37:47] Amir1: are you only planning the 1 wiki creation? If so I do wikistats in a minute [15:37:52] After tv episode finishes [15:38:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudservices1006'] [15:38:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T343718)', diff saved to https://phabricator.wikimedia.org/P51935 and previous config saved to /var/cache/conftool/dbconfig/20230829-153801-ladsgroup.json [15:38:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:38:11] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:38:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:38:20] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [15:38:52] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:39:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:29] (03CR) 10Muehlenhoff: firewall: move contrac logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [15:41:46] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:10] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:52] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:45:41] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [15:46:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:17] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [15:46:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P51936 and previous config saved to /var/cache/conftool/dbconfig/20230829-154638-ladsgroup.json [15:47:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) Please use `partman/raid10-4dev.cfg` as it seems the most standard for 4 devices. [15:47:40] (03CR) 10Herron: [C: 03+1] LiftWing: add latency/availability SLO dashboards (037 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:47:58] (03PS1) 10Papaul: Add cloudservices1006 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/953299 (https://phabricator.wikimedia.org/T342161) [15:49:09] (03CR) 10Papaul: [C: 03+2] Add cloudservices1006 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/953299 (https://phabricator.wikimedia.org/T342161) (owner: 10Papaul) [15:49:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P51937 and previous config saved to /var/cache/conftool/dbconfig/20230829-154909-ladsgroup.json [15:50:10] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2001.codfw.wmnet [15:51:19] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe [15:52:09] (03CR) 10Herron: [C: 03+1] LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:53:27] jouncebot: nowandnext [15:53:28] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [15:53:28] In 0 hour(s) and 6 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1600) [15:54:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [15:54:47] (03CR) 10FNegri: [C: 03+2] [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [15:54:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [15:55:31] (03PS9) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [15:55:39] (03PS6) 10FNegri: [openstack] remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) [15:56:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye [15:56:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [15:56:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2001.codfw.wmnet [15:58:38] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2002.codfw.wmnet [15:59:06] (03CR) 10Andrew Bogott: [C: 03+1] [openstack] remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [15:59:39] (03CR) 10Andrew Bogott: [C: 03+1] New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [16:00:04] jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1600). [16:00:04] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Ladsgroup) a:05Mabualruz→03Ladsgroup SSH key confirmed out of band. [16:00:29] (03PS21) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [16:01:42] (03CR) 10Hnowlan: [C: 03+2] kubernetes: add users for media_analytics service, cassandra config [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [16:01:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T343718)', diff saved to https://phabricator.wikimedia.org/P51938 and previous config saved to /var/cache/conftool/dbconfig/20230829-160144-ladsgroup.json [16:01:50] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:02:29] (03PS2) 10Ladsgroup: ores-extension: replace first batch of wikis model thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953259 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [16:02:34] (03CR) 10Ladsgroup: [C: 03+2] ores-extension: replace first batch of wikis model thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953259 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [16:02:58] jouncebot nowandnext [16:02:58] For the next 0 hour(s) and 57 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1600) [16:02:58] In 0 hour(s) and 57 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1700) [16:03:15] (03Merged) 10jenkins-bot: ores-extension: replace first batch of wikis model thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953259 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [16:03:45] (03CR) 10Ilias Sarantopoulos: ores-extension: replace first batch of wikis model thresholds with numeric values (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953259 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [16:04:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51939 and previous config saved to /var/cache/conftool/dbconfig/20230829-160415-ladsgroup.json [16:04:16] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:953259|ores-extension: replace first batch of wikis model thresholds with numeric values (T343308)]] [16:04:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:04:26] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [16:04:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:05:20] dancy: You deploying? OK for me to push out some service fixes? [16:05:36] Not deploying. All yours. [16:05:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2002.codfw.wmnet [16:05:42] Cool. [16:05:44] !log ladsgroup@deploy1002 ladsgroup and isaranto: Backport for [[gerrit:953259|ores-extension: replace first batch of wikis model thresholds with numeric values (T343308)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:05:58] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Don't expose envoy admin and k8s service accounts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953291 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [16:06:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [16:06:18] James_F: Amir1 is scaping [16:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:06:31] Yeah, MW-side won't affect anything. [16:06:55] (03Merged) 10jenkins-bot: wikifunctions: Don't expose envoy admin and k8s service accounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/953291 (https://phabricator.wikimedia.org/T343709) (owner: 10JMeybohm) [16:07:13] (03Merged) 10jenkins-bot: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [16:07:30] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:07:37] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:07:44] Bah, waiting for chartmuseum. [16:08:21] !log ladsgroup@deploy1002 ladsgroup and isaranto: Continuing with sync [16:08:41] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:08:43] (03CR) 10Hnowlan: [C: 03+2] AQS2: Move common settings (AQS cassandra nodes) to _aqs2-common_ [deployment-charts] - 10https://gerrit.wikimedia.org/r/953283 (https://phabricator.wikimedia.org/T336400) (owner: 10JMeybohm) [16:09:10] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [16:09:11] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:09:19] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [16:09:29] (03Merged) 10jenkins-bot: AQS2: Move common settings (AQS cassandra nodes) to _aqs2-common_ [deployment-charts] - 10https://gerrit.wikimedia.org/r/953283 (https://phabricator.wikimedia.org/T336400) (owner: 10JMeybohm) [16:09:33] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1022.eqiad.wmnet [16:09:43] !log deploy cxserver mariadb egress functionality. T341117 [16:09:43] o/ (sorry I'm a bit late) [16:09:46] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:48] T341117: cxserver: Section Mapping Database (m5) not accessible by certain region - https://phabricator.wikimedia.org/T341117 [16:10:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:35] is someone able to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/952045/ ? [16:10:50] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:10:53] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [16:11:05] it's a fairly trivial change to multi-dc URL rules [16:11:06] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:11:22] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [16:11:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:12:03] tgr: I can sort it out [16:12:07] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:12:24] thanks bblack! [16:13:25] (03PS2) 10Jforrester: Fix wikifunctions orchestrator not using the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/952782 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [16:13:35] (03CR) 10Jforrester: [C: 03+2] "I'll deploy as-is for now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/952782 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [16:13:48] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:953259|ores-extension: replace first batch of wikis model thresholds with numeric values (T343308)]] (duration: 09m 31s) [16:13:53] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [16:14:13] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [16:14:23] (03Merged) 10jenkins-bot: Fix wikifunctions orchestrator not using the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/952782 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [16:14:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [16:15:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:45] (03CR) 10BBlack: [C: 03+2] multi-dc: Fix central autologin URL pattern [puppet] - 10https://gerrit.wikimedia.org/r/952045 (owner: 10Gergő Tisza) [16:16:09] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [16:16:31] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [16:16:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:17:12] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:17:34] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [16:17:47] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [16:17:49] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:18:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1022.eqiad.wmnet [16:18:24] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:33] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:18:47] tgr: it will take ~30m to naturaly roll out to affect all nodes [16:18:53] *naturally :) [16:19:25] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:19:27] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:20:20] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:35] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:21:12] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:21:36] (03PS1) 10Jforrester: Revert "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953211 [16:21:40] (03CR) 10Jforrester: [C: 03+2] Revert "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953211 (owner: 10Jforrester) [16:22:27] (03Merged) 10jenkins-bot: Revert "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953211 (owner: 10Jforrester) [16:23:10] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:11] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:23:29] (03PS1) 10Alexandros Kosiaris: mariadb egress: Second round of brown paper bag fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/953308 (https://phabricator.wikimedia.org/T340843) [16:23:59] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:24:20] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:24:27] (03CR) 10Jelto: [C: 03+1] "lgtm. We can think about leaving the thanos config option available but set thanos_storage_enabled to false on all hosts. But just lookup " [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney) [16:25:23] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:25:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:38] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:25:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] mariadb egress: Second round of brown paper bag fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/953308 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [16:26:07] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:26:37] (03Merged) 10jenkins-bot: mariadb egress: Second round of brown paper bag fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/953308 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [16:27:06] (03PS6) 10Hnowlan: helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) [16:27:22] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:46] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:21] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [16:30:25] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [16:30:48] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [16:30:53] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [16:30:58] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [16:31:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:28] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [16:31:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:31:53] that redis alert firing is ORES again [16:32:00] (03PS7) 10Hnowlan: helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) [16:32:03] why is it firing so consistently lately? [16:32:38] (03CR) 10Hnowlan: helmfile: add entries and namespace for media-analytics service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [16:32:39] akosiaris: is it? I thought it was rdb1011 only, I downtimed those alerts [16:33:44] ah snap it is a replica [16:33:48] ufff [16:34:36] * elukey downtimes the alert [16:34:40] I 've edited your downtime to include the other hosts [16:34:45] and extended it to 50 days [16:34:47] thanks! [16:35:03] if we don't deprecate ores in 50 days I'll cry [16:35:42] (03PS1) 10Jforrester: Re-apply "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) [16:39:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:57] (03PS1) 10Hnowlan: wmnet: add geo-analytics and media-analytics ingress records [dns] - 10https://gerrit.wikimedia.org/r/953311 (https://phabricator.wikimedia.org/T336400) [16:55:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:24] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-c734.scope,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1700) [17:00:11] (03CR) 10FNegri: [C: 03+2] New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [17:00:13] (03CR) 10FNegri: [C: 03+2] [openstack] remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [17:00:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:09] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953313 (https://phabricator.wikimedia.org/T343726) [17:04:11] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953313 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [17:04:53] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953313 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [17:05:21] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.24 refs T343726 [17:05:28] T343726: 1.41.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T343726 [17:09:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:14:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:20] (03PS1) 10Urbanecm: Growth: Disable Add an image on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953314 (https://phabricator.wikimedia.org/T345188) [17:27:39] jeena: i see you're deploying train to testwikis. the above patch is fairly urgent (but not urgent enough to stop the promotion :)) -- can you please ping me once you're finished please? [17:28:09] (03CR) 10Jforrester: [C: 03+1] "Approved. You might want a less generic text than "A new editing interface that allows you to edit pages faster" for the description to ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [17:31:03] urbanecm: This is just the deployment to testwikis that was delayed. I won't deploy to group0 until the window at 18:00 UTC. So we can include your patch I think [17:31:14] I'll ping you when this is don [17:31:17] ty! [17:33:05] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1023.eqiad.wmnet [17:34:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:35:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:35:12] (03PS1) 10Dreamy Jazz: clienthints: Raise maxlag for API back to default for group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953315 (https://phabricator.wikimedia.org/T344797) [17:35:46] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:51] bblack: tested, works. Thanks again! [17:37:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T343718)', diff saved to https://phabricator.wikimedia.org/P51941 and previous config saved to /var/cache/conftool/dbconfig/20230829-173707-ladsgroup.json [17:37:10] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:13] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [17:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [17:38:56] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:39:04] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:40:34] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:41:04] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:41:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1023.eqiad.wmnet [17:42:36] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:47:17] (03PS1) 10Volans: sre.ganeti.makevm: fix bug getting cluster name [cookbooks] - 10https://gerrit.wikimedia.org/r/953318 (https://phabricator.wikimedia.org/T344813) [17:48:49] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.24 refs T343726 (duration: 43m 27s) [17:48:55] T343726: 1.41.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T343726 [17:48:56] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:49:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10decommission-hardware: hw troubleshooting: ipmi down for wdqs1005.eqiad.wmnet - https://phabricator.wikimedia.org/T345081 (10Jclark-ctr) 05Open→03Resolved a:05Papaul→03Jclark-ctr performed flea power drain idrac connection came back [17:49:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1024.eqiad.wmnet [17:51:02] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.41.0-wmf.22 (duration: 02m 11s) [17:51:53] urbanecm: done [17:51:57] ty [17:52:10] (03CR) 10Urbanecm: [C: 03+2] Growth: Disable Add an image on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953314 (https://phabricator.wikimedia.org/T345188) (owner: 10Urbanecm) [17:52:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P51942 and previous config saved to /var/cache/conftool/dbconfig/20230829-175213-ladsgroup.json [17:52:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953314 (https://phabricator.wikimedia.org/T345188) (owner: 10Urbanecm) [17:52:50] (03Merged) 10jenkins-bot: Growth: Disable Add an image on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953314 (https://phabricator.wikimedia.org/T345188) (owner: 10Urbanecm) [17:52:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:27] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:953314|Growth: Disable Add an image on all wikis (T345188)]] [17:53:32] T345188: Add Image: all wikis ran out of image recommendations - https://phabricator.wikimedia.org/T345188 [17:53:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:53:51] (03CR) 10Volans: "the previous fix had a bug actually... sorry about that" [cookbooks] - 10https://gerrit.wikimedia.org/r/953318 (https://phabricator.wikimedia.org/T344813) (owner: 10Volans) [17:54:19] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:55:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices1006.eqiad.wmnet with OS bullseye [17:56:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye exe... [17:58:38] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock Arnold and testing it at the same time." [cookbooks] - 10https://gerrit.wikimedia.org/r/953318 (https://phabricator.wikimedia.org/T344813) (owner: 10Volans) [17:58:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:59:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1024.eqiad.wmnet [18:00:06] jeena and dduvall: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1800). [18:00:08] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:00:14] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:953314|Growth: Disable Add an image on all wikis (T345188)]] (duration: 06m 47s) [18:00:29] T345188: Add Image: all wikis ran out of image recommendations - https://phabricator.wikimedia.org/T345188 [18:01:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:23] (03Merged) 10jenkins-bot: sre.ganeti.makevm: fix bug getting cluster name [cookbooks] - 10https://gerrit.wikimedia.org/r/953318 (https://phabricator.wikimedia.org/T344813) (owner: 10Volans) [18:04:18] !log aokoth@cumin1001 START - Cookbook sre.ganeti.makevm for new host vrts1002.eqiad.wmnet [18:04:19] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [18:05:08] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:05:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:06:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:06:10] * urbanecm done [18:06:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T343718)', diff saved to https://phabricator.wikimedia.org/P51943 and previous config saved to /var/cache/conftool/dbconfig/20230829-180613-ladsgroup.json [18:06:24] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [18:06:40] thanks urbanecm [18:07:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P51944 and previous config saved to /var/cache/conftool/dbconfig/20230829-180719-ladsgroup.json [18:07:20] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953321 (https://phabricator.wikimedia.org/T343726) [18:07:22] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953321 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [18:07:30] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1029.eqiad.wmnet [18:08:00] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts1002.eqiad.wmnet - aokoth@cumin1001" [18:08:09] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953321 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [18:08:47] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts1002.eqiad.wmnet - aokoth@cumin1001" [18:08:47] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:08:47] !log aokoth@cumin1001 START - Cookbook sre.dns.wipe-cache vrts1002.eqiad.wmnet on all recursors [18:08:51] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) vrts1002.eqiad.wmnet on all recursors [18:09:16] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM vrts1002.eqiad.wmnet - aokoth@cumin1001" [18:10:02] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM vrts1002.eqiad.wmnet - aokoth@cumin1001" [18:11:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [18:11:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) stat1011 C 3 , U40 Port 40 Cableid 3750 [18:12:10] !log aokoth@cumin1001 START - Cookbook sre.hosts.reimage for host vrts1002.eqiad.wmnet with OS bullseye [18:14:45] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.24 refs T343726 [18:14:52] T343726: 1.41.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T343726 [18:16:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1029.eqiad.wmnet [18:19:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jclark-ctr) pki1002 C 6 , U 13 Port 8 Cableid 3190 [18:20:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jclark-ctr) [18:21:32] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts1002.eqiad.wmnet with reason: host reimage [18:22:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T343718)', diff saved to https://phabricator.wikimedia.org/P51945 and previous config saved to /var/cache/conftool/dbconfig/20230829-182225-ladsgroup.json [18:22:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [18:22:31] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [18:22:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [18:22:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:22:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:22:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T343718)', diff saved to https://phabricator.wikimedia.org/P51946 and previous config saved to /var/cache/conftool/dbconfig/20230829-182251-ladsgroup.json [18:23:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:24:38] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts1002.eqiad.wmnet with reason: host reimage [18:24:41] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1032.eqiad.wmnet [18:28:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:32:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1032.eqiad.wmnet [18:37:02] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host vrts1002.eqiad.wmnet with OS bullseye [18:37:02] !log aokoth@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host vrts1002.eqiad.wmnet [18:37:03] (03PS1) 10Ryan Kemper: s/fdqn/fqdn [puppet] - 10https://gerrit.wikimedia.org/r/953325 [18:37:29] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "1 byte comment change, just gonna merge" [puppet] - 10https://gerrit.wikimedia.org/r/953325 (owner: 10Ryan Kemper) [18:40:08] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [18:40:52] 10SRE, 10ops-eqiad: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 (10Jclark-ctr) a:03Jclark-ctr Replaced hdd with spare hdd [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:45:20] 10SRE, 10ops-eqiad: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 (10Jclark-ctr) 2023-08-29 18:36:50 PDR3 Disk 11 in Backplane 1 of Integrated RAID Controller 1 is not functioning correctly. Part Number = TH0XPJ47SGT0003K01PDA00 [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:49:16] (03PS1) 10Papaul: Add cloudservices1006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/953327 (https://phabricator.wikimedia.org/T342161) [18:49:41] 10SRE, 10ops-eqiad: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 (10Jclark-ctr) imported foreign configuration to raid controller [18:49:43] (03CR) 10CI reject: [V: 04-1] Add cloudservices1006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/953327 (https://phabricator.wikimedia.org/T342161) (owner: 10Papaul) [18:51:02] 10SRE, 10ops-eqiad: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 (10Jclark-ctr) 05Open→03Resolved [18:52:42] (03PS2) 10Papaul: Add cloudservices1006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/953327 (https://phabricator.wikimedia.org/T342161) [18:52:55] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1025.eqiad.wmnet [18:53:29] (03CR) 10Papaul: [C: 03+2] Add cloudservices1006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/953327 (https://phabricator.wikimedia.org/T342161) (owner: 10Papaul) [18:53:48] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:54:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bullseye [18:55:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye [18:55:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices1006.eqiad.wmnet with OS bullseye [18:55:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye exe... [18:55:32] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Jclark-ctr) [18:55:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bullseye [18:55:41] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Jclark-ctr) 05Open→03Resolved [18:55:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye [18:55:52] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10Jclark-ctr) [18:56:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage [18:56:54] (03PS1) 10Zabe: Allow setting configurations through rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953328 [18:57:34] (03CR) 10CI reject: [V: 04-1] Allow setting configurations through rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953328 (owner: 10Zabe) [18:58:34] jouncebot: nowandnext [18:58:34] For the next 1 hour(s) and 1 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T1800) [18:58:34] In 1 hour(s) and 1 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T2000) [18:58:48] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:59:41] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952984 [18:59:43] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952984 (owner: 10Zabe) [19:00:31] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952984 (owner: 10Zabe) [19:00:46] !log zabe@deploy1002 Started scap: update interwiki cache [19:01:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1025.eqiad.wmnet [19:01:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage [19:06:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T343718)', diff saved to https://phabricator.wikimedia.org/P51947 and previous config saved to /var/cache/conftool/dbconfig/20230829-190635-ladsgroup.json [19:06:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:50] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [19:07:54] !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 07m 08s) [19:08:36] (03PS2) 10Zabe: Allow setting configurations through rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953328 [19:08:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:08:50] * zabe done [19:10:08] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1026.eqiad.wmnet [19:10:56] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 173 [19:11:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 173 [19:13:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:13:53] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e1-eqiad [19:14:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) [19:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e1-eqiad [19:16:12] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e2-eqiad [19:16:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) a:03Jclark-ctr [19:18:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:18:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1026.eqiad.wmnet [19:18:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e2-eqiad [19:18:31] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e3-eqiad [19:19:46] Hello. We have a Gerrit patch that is refusing to merge. We have +2'd it twice, commented `recheck` twice, and then removed the reviewers and re-added myself back as a reviewer and +2'd it again, and it still has not merged. The CI has returned green each time as well. Anyone have any ideas? https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/947474 [19:20:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e3-eqiad [19:20:50] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [19:21:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P51948 and previous config saved to /var/cache/conftool/dbconfig/20230829-192141-ladsgroup.json [19:23:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-eqiad [19:23:10] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f2-eqiad [19:24:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:24:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1006.eqiad.wmnet with OS bullseye [19:24:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye com... [19:25:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f2-eqiad [19:25:29] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f3-eqiad [19:26:42] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1033.eqiad.wmnet [19:26:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Papaul) [19:27:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f3-eqiad [19:27:48] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device ssw1-e1-eqiad [19:27:53] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10Papaul) [19:28:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Papaul) 05Open→03Resolved @aborrero all your's [19:29:14] kimberly_sarabia: generally a good troubleshooting step is to rebase the patch [19:29:48] taavi: thx will try that [19:29:58] also note that 'recheck' will not attempt to retry the gate-and-submit pipeline which would ultimately merge it, you would need to remove and re-apply the +2 [19:30:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-e1-eqiad [19:30:07] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device ssw1-f1-eqiad [19:31:29] oh gotcha. gtk [19:32:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-f1-eqiad [19:32:27] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device asw2-c2-eqiad [19:32:46] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device asw2-c2-eqiad [19:35:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1033.eqiad.wmnet [19:36:44] (03CR) 10Dmaza: [C: 03+1] wikidiff2: set maxSplitSize = 10 by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952940 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [19:36:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P51949 and previous config saved to /var/cache/conftool/dbconfig/20230829-193648-ladsgroup.json [19:38:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:43:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:47:28] taavi: that did it. thanks! [19:51:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T343718)', diff saved to https://phabricator.wikimedia.org/P51950 and previous config saved to /var/cache/conftool/dbconfig/20230829-195154-ladsgroup.json [19:51:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [19:52:00] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [19:52:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [19:52:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T343718)', diff saved to https://phabricator.wikimedia.org/P51951 and previous config saved to /var/cache/conftool/dbconfig/20230829-195215-ladsgroup.json [19:56:43] (03PS1) 10BryanDavis: Mark Gerrit repo as abandoned [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/953338 (https://phabricator.wikimedia.org/T345182) [19:57:00] (03CR) 10CI reject: [V: 04-1] Mark Gerrit repo as abandoned [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/953338 (https://phabricator.wikimedia.org/T345182) (owner: 10BryanDavis) [19:58:01] (03CR) 10BryanDavis: [V: 03+2 C: 03+2] Mark Gerrit repo as abandoned [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/953338 (https://phabricator.wikimedia.org/T345182) (owner: 10BryanDavis) [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230829T2000). [20:00:06] Dreamy_Jazz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] I can deploy today [20:00:24] Dreamy_Jazz: hey! [20:02:25] Dreamy_Jazz: hi! [20:07:08] \o [20:07:11] Apologies for delay [20:07:22] no worries, let's start [20:07:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953315 (https://phabricator.wikimedia.org/T344797) (owner: 10Dreamy Jazz) [20:07:51] (03PS2) 10Urbanecm: clienthints: Raise maxlag for API back to default for group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953315 (https://phabricator.wikimedia.org/T344797) (owner: 10Dreamy Jazz) [20:07:54] (03CR) 10Urbanecm: [C: 03+2] clienthints: Raise maxlag for API back to default for group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953315 (https://phabricator.wikimedia.org/T344797) (owner: 10Dreamy Jazz) [20:08:01] (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953315 (https://phabricator.wikimedia.org/T344797) (owner: 10Dreamy Jazz) [20:08:13] I can't test this patch as it's not going to be feasible to do so (waiting for 30 mins to test for behaviour). It essentially reverts the last change to do with this task. [20:08:25] yup yup, i recall that [20:08:27] so, just sync i presume [20:08:34] (03Merged) 10jenkins-bot: clienthints: Raise maxlag for API back to default for group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953315 (https://phabricator.wikimedia.org/T344797) (owner: 10Dreamy Jazz) [20:08:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:08:52] If possible, thanks. [20:09:03] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:953315|clienthints: Raise maxlag for API back to default for group0 and 1 (T344797)]] [20:09:05] ack, will do [20:09:11] T344797: Decrease CheckUserClientHintsRestApiMaxTimeLag config on production wikis - https://phabricator.wikimedia.org/T344797 [20:10:35] !log urbanecm@deploy1002 urbanecm and dreamyjazz: Backport for [[gerrit:953315|clienthints: Raise maxlag for API back to default for group0 and 1 (T344797)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:10:40] !log urbanecm@deploy1002 urbanecm and dreamyjazz: Continuing with sync [20:10:42] proceeding [20:12:43] I should be able to assert that this patch works once I look at logstash and see less entries tommorrow. [20:13:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:14:09] Dreamy_Jazz: ack, makes sense. feel free to schedule a reverting/altering patch if needed. [20:14:33] 👍 [20:16:01] (03CR) 10Volans: "FYI The failing tests are because of pyparsing, I've opened:" [software/cumin] - 10https://gerrit.wikimedia.org/r/953294 (owner: 10Jbond) [20:16:16] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:953315|clienthints: Raise maxlag for API back to default for group0 and 1 (T344797)]] (duration: 07m 13s) [20:16:23] Dreamy_Jazz: synced :) [20:16:25] T344797: Decrease CheckUserClientHintsRestApiMaxTimeLag config on production wikis - https://phabricator.wikimedia.org/T344797 [20:16:25] anything else? [20:16:34] Thanks! Nothing else. [20:16:52] np! [20:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:20:31] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/953203 (https://phabricator.wikimedia.org/T341497) (owner: 10Jbond) [20:23:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:36] (03CR) 10Volans: "LGTM but I have some questions." [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [20:28:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:31:56] (03CR) 10Volans: "That's neat! Thanks for the patch! Some minor things to adjust/decide inline." [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 (owner: 10Jbond) [20:33:02] (03PS1) 10Urbanecm: growthexperiments: Run listTaskCounts for all task types [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) [20:33:16] (03CR) 10Urbanecm: [C: 04-1] "not yet" [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [20:33:48] (03PS2) 10Urbanecm: growthexperiments: Run listTaskCounts for all task types [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) [20:33:54] (03CR) 10Urbanecm: [C: 04-1] "not yet" [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [20:34:42] (03CR) 10CI reject: [V: 04-1] growthexperiments: Run listTaskCounts for all task types [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [20:38:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:40:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T343718)', diff saved to https://phabricator.wikimedia.org/P51952 and previous config saved to /var/cache/conftool/dbconfig/20230829-204039-ladsgroup.json [20:40:46] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [20:43:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:46:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:49:47] (03PS1) 10Urbanecm: alertmanager: route Growth team alerts [puppet] - 10https://gerrit.wikimedia.org/r/953347 (https://phabricator.wikimedia.org/T345202) [20:50:33] (03PS2) 10Urbanecm: alertmanager: route Growth team alerts [puppet] - 10https://gerrit.wikimedia.org/r/953347 (https://phabricator.wikimedia.org/T345202) [20:51:00] (03PS3) 10Urbanecm: alertmanager: route Growth team alerts [puppet] - 10https://gerrit.wikimedia.org/r/953347 (https://phabricator.wikimedia.org/T345202) [20:51:09] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953347 (https://phabricator.wikimedia.org/T345202) (owner: 10Urbanecm) [20:51:34] (03CR) 10Urbanecm: [C: 04-1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [20:51:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:55:31] (03PS1) 10Urbanecm: dumps: Advertise growthmentorship dumps from index.html [puppet] - 10https://gerrit.wikimedia.org/r/953348 [20:55:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P51953 and previous config saved to /var/cache/conftool/dbconfig/20230829-205546-ladsgroup.json [21:08:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:10:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P51954 and previous config saved to /var/cache/conftool/dbconfig/20230829-211052-ladsgroup.json [21:13:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:13:33] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:13:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:13:45] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:13:51] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:13:56] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:14:10] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:18:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:19:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2026 hosts in codfw - jhancock@cumin2002" [21:19:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2026 hosts in codfw - jhancock@cumin2002" [21:19:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:20:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:20:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:20:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:20:24] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:21:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:22:06] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:23:12] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1030.eqiad.wmnet [21:23:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:25:04] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:25:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T343718)', diff saved to https://phabricator.wikimedia.org/P51955 and previous config saved to /var/cache/conftool/dbconfig/20230829-212558-ladsgroup.json [21:26:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [21:26:04] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [21:26:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [21:26:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T343718)', diff saved to https://phabricator.wikimedia.org/P51956 and previous config saved to /var/cache/conftool/dbconfig/20230829-212619-ladsgroup.json [21:26:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:26:29] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:27:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:27:52] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:29:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:29:15] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:30:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:30:54] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:32:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:33:36] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:35:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:36:31] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host restbase1030.eqiad.wmnet [21:37:08] PROBLEM - cassandra-a SSL 10.64.48.234:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:37:08] PROBLEM - cassandra-a service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:37:21] ^^^ got it [21:37:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2026.mgmt.codfw.wmnet with reboot policy FORCED [21:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [21:38:33] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.234 and port 9042: Connection refused eevans Downtime for maintenance was reset. https://phabricator.wikimedia.org/T93886 [21:38:33] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.234:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Downtime for maintenance was reset. https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:38:33] ACKNOWLEDGEMENT - cassandra-a service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive eevans Downtime for maintenance was reset. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:38:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2027.mgmt.codfw.wmnet with reboot policy FORCED [21:42:09] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [21:48:56] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:49:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2027.mgmt.codfw.wmnet with reboot policy FORCED [21:50:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2026.mgmt.codfw.wmnet with reboot policy FORCED [21:50:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2026.mgmt.codfw.wmnet with reboot policy FORCED [21:52:39] (03CR) 10Cory Massaro: [C: 03+1] "LGTM but I trust others to know better" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [21:53:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2026.mgmt.codfw.wmnet with reboot policy FORCED [22:04:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T343718)', diff saved to https://phabricator.wikimedia.org/P51957 and previous config saved to /var/cache/conftool/dbconfig/20230829-220451-ladsgroup.json [22:04:58] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [22:19:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P51958 and previous config saved to /var/cache/conftool/dbconfig/20230829-221958-ladsgroup.json [22:35:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P51959 and previous config saved to /var/cache/conftool/dbconfig/20230829-223504-ladsgroup.json [22:40:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2026.mgmt.codfw.wmnet with reboot policy FORCED [22:50:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T343718)', diff saved to https://phabricator.wikimedia.org/P51960 and previous config saved to /var/cache/conftool/dbconfig/20230829-225010-ladsgroup.json [22:50:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [22:50:21] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [22:50:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [22:50:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T343718)', diff saved to https://phabricator.wikimedia.org/P51961 and previous config saved to /var/cache/conftool/dbconfig/20230829-225031-ladsgroup.json [22:50:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2026.mgmt.codfw.wmnet with reboot policy FORCED [22:52:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2027.mgmt.codfw.wmnet with reboot policy FORCED [22:59:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [23:05:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2027.mgmt.codfw.wmnet with reboot policy FORCED [23:06:15] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [23:08:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2034 hosts in codfw - jhancock@cumin2002" [23:08:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2034 hosts in codfw - jhancock@cumin2002" [23:08:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:12:31] 10SRE: Switch visibility level to Internal for GitLab repositories - https://phabricator.wikimedia.org/T345215 (10ppenloglou) [23:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T343718)', diff saved to https://phabricator.wikimedia.org/P51962 and previous config saved to /var/cache/conftool/dbconfig/20230829-233549-ladsgroup.json [23:35:56] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [23:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P51963 and previous config saved to /var/cache/conftool/dbconfig/20230829-235055-ladsgroup.json