[00:02:45] (03PS1) 10Gergő Tisza: Disable more extensions when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094071 (https://phabricator.wikimedia.org/T373737) [00:06:32] (03PS1) 10Ryan Kemper: wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) [00:11:00] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-etcd2004.codfw.wmnet with OS bookworm [00:11:00] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd2004.codfw.wmnet [00:11:05] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10347008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-etcd2004.codfw.wmnet with OS bookworm completed: - aux-k8s-etcd2004 (**PASS**) - R... [00:11:25] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd2005.codfw.wmnet [00:11:26] !log herron@cumin1002 START - Cookbook sre.dns.netbox [00:16:53] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd2005.codfw.wmnet - herron@cumin1002" [00:20:06] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd2005.codfw.wmnet - herron@cumin1002" [00:20:06] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:20:07] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd2005.codfw.wmnet on all recursors [00:20:10] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd2005.codfw.wmnet on all recursors [00:20:36] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd2005.codfw.wmnet - herron@cumin1002" [00:20:40] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd2005.codfw.wmnet - herron@cumin1002" [00:25:48] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:25:53] PROBLEM - BFD status on cloudsw1-e4-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:27:47] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-etcd2005.codfw.wmnet with OS bookworm [00:27:55] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10347033 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-etcd2005.codfw.wmnet with OS bookworm [00:32:05] FIRING: [2x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1094080 [00:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1094080 (owner: 10TrainBranchBot) [00:42:55] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-etcd2005.codfw.wmnet with reason: host reimage [00:46:32] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-etcd2005.codfw.wmnet with reason: host reimage [01:00:02] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-etcd2005.codfw.wmnet with OS bookworm [01:00:02] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd2005.codfw.wmnet [01:00:13] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10347071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-etcd2005.codfw.wmnet with OS bookworm completed: - aux-k8s-etcd2005 (**PASS**) - R... [01:00:46] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10347072 (10Papaul) @Jhancock.wm it did send the request again to puppetmaster1001. i will ping IF tomorrow to see what's going on. why the puppet reque... [01:04:39] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10347076 (10phaultfinder) [01:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1094083 [01:08:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1094083 (owner: 10TrainBranchBot) [01:11:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1094080 (owner: 10TrainBranchBot) [01:15:40] (03CR) 10BryanDavis: role::beta::deploymentserver: Populate docker group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [01:16:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:42:03] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1094083 (owner: 10TrainBranchBot) [02:02:20] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/5a87c472d1bdb5db886fe378e3325898a5fa846675dba32bacbf3c9c596c5234/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:04:21] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10347102 (10Platonides) I just found a weird behavior. Just by //reloading// the page, I get different sets of held messages. It mostly shows me... [02:22:20] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:02] (03CR) 10RLazarus: "Thanks for looping me in! No objections from the mwscript side." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:32:05] FIRING: [4x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:35:24] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1093890 (owner: 10L10n-bot) [03:46:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:47:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:00:55] FIRING: MaxConntrack: Max conntrack at 99.04% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [04:01:34] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [04:04:34] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 99 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [04:06:34] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 34 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [04:10:50] (03PS1) 10Tim Starling: Move default main page text for new wikis to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) [04:10:55] RESOLVED: MaxConntrack: Max conntrack at 96.32% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [04:11:31] (03CR) 10CI reject: [V:04-1] Move default main page text for new wikis to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [04:37:36] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [04:41:16] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [04:51:39] (03PS3) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1089939 [05:16:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:38:06] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 71164MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [05:47:43] (03PS2) 10Tim Starling: Move default main page text for new wikis to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241122T0700) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:24:48] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 80069MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [07:26:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:31:16] (03CR) 10Ayounsi: "Sounds good!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [07:32:05] FIRING: [4x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:37:12] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10347297 (10MoritzMuehlenhoff) How are we planning to handle removing the servers on our side? I think we should run the decom cookb... [07:48:53] (03CR) 10JMeybohm: [C:03+1] mediawiki: support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241122T0800) [08:06:31] (03PS1) 10Giuseppe Lavagetto: New version deployment [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1094283 [08:06:42] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] New version deployment [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1094283 (owner: 10Giuseppe Lavagetto) [08:07:33] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add sorting options to tree view - oblivian@cumin1002" [08:07:37] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add sorting options to tree view - oblivian@cumin1002 [08:08:12] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add sorting options to tree view - oblivian@cumin1002 [08:08:13] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add sorting options to tree view - oblivian@cumin1002" [08:09:33] running out of space on centrallog syslogs [08:10:12] (03PS1) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [08:11:25] (03CR) 10CI reject: [V:04-1] WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [08:15:04] (03PS2) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [08:16:19] (03CR) 10CI reject: [V:04-1] WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [08:16:28] should I increase the lvs space or remove older logs? [08:17:52] there was a large increase since yesterday at 17:40 [08:23:57] (03PS1) 10Muehlenhoff: Extends MOU for Bob West [puppet] - 10https://gerrit.wikimedia.org/r/1094289 [08:24:50] (03PS1) 10Ayounsi: Expose _gql_execute to wmf-netbox [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [08:25:56] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1094289 (owner: 10Muehlenhoff) [08:27:29] (03CR) 10Muehlenhoff: "We don't we just make a proper mapnik deb? It seems fairly straightforward, I can look into it next week?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [08:27:42] (03CR) 10Muehlenhoff: [C:03+2] Extends MOU for Bob West [puppet] - 10https://gerrit.wikimedia.org/r/1094289 (owner: 10Muehlenhoff) [08:33:13] (03CR) 10LSobanski: Filter out addresses that cannot be removed from VRTS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [08:33:40] (03CR) 10Muehlenhoff: [C:03+2] snapshot: Update Cumin alias with dumper_fillin_wd role [puppet] - 10https://gerrit.wikimedia.org/r/1092861 (owner: 10Muehlenhoff) [08:40:15] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki: support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [08:40:54] (03CR) 10Alexandros Kosiaris: [C:03+1] hieradata: add "migration" release of mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1081451 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [08:41:07] (03CR) 10Alexandros Kosiaris: [C:03+1] hieradata: add remaining "migration" releases [puppet] - 10https://gerrit.wikimedia.org/r/1082865 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [08:50:13] (03CR) 10Brouberol: "I have some questions about the chosen endpoints." [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [08:50:43] (03CR) 10Brouberol: [C:03+1] wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [08:51:43] (03CR) 10Brouberol: [C:03+1] wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [08:52:12] (03CR) 10Brouberol: [C:03+1] wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [08:52:24] (03CR) 10Brouberol: [C:03+1] wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [09:07:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1093953 (https://phabricator.wikimedia.org/T380476) (owner: 10Jelto) [09:11:52] (03PS3) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [09:15:38] (03CR) 10Muehlenhoff: "One comment inline, you also need to add "vopsbot" to the "Update:" list in files/distributions-wikimedia for the target distro." [puppet] - 10https://gerrit.wikimedia.org/r/1093875 (owner: 10Giuseppe Lavagetto) [09:15:56] (03PS3) 10Jcrespo: mediabackup: Setup backup1010 as the 6th media backup host in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892) [09:16:00] (03CR) 10Jcrespo: [C:03+2] mediabackup: Setup backup1010 as the 6th media backup host in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo) [09:16:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:17:04] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 74150MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [09:25:56] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 9 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:26:47] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10347376 (10WMDECyn) approved from WMDE end [09:26:56] RECOVERY - BFD status on cloudsw1-e4-eqiad.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:31:25] (03CR) 10Arnaudb: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1093884 (https://phabricator.wikimedia.org/T370452) (owner: 10MVernon) [09:32:35] (03CR) 10Arnaudb: [C:03+1] thanos: storage schema for larger disks_by_path backends, add 2 [puppet] - 10https://gerrit.wikimedia.org/r/1093885 (https://phabricator.wikimedia.org/T370452) (owner: 10MVernon) [09:32:47] (03CR) 10Arnaudb: [C:03+1] thanos: add new backends to profile::thanos::swift::backends [puppet] - 10https://gerrit.wikimedia.org/r/1093884 (https://phabricator.wikimedia.org/T370452) (owner: 10MVernon) [09:34:25] (03CR) 10Jcrespo: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo) [09:35:14] (03PS3) 10Jcrespo: mediabackup: Setup backup2010 as the 6th media backup host in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892) [09:44:30] (03CR) 10MVernon: [C:03+2] thanos: add new backends to profile::thanos::swift::backends [puppet] - 10https://gerrit.wikimedia.org/r/1093884 (https://phabricator.wikimedia.org/T370452) (owner: 10MVernon) [09:44:33] (03CR) 10MVernon: [C:03+2] thanos: storage schema for larger disks_by_path backends, add 2 [puppet] - 10https://gerrit.wikimedia.org/r/1093885 (https://phabricator.wikimedia.org/T370452) (owner: 10MVernon) [09:45:02] (03CR) 10Elukey: "Could be an option yes, but we do something similar for other projects as well in this repo to avoid the Debian packaging, I thought it wa" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [09:45:52] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10347411 (10elukey) [09:46:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:48:24] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10347412 (10elukey) 05Open→03Resolved a:03elukey The host is fully in service now and I had a chat with Matthew to put it in production, resol... [09:51:25] (03CR) 10Elukey: [C:03+1] puppetboard: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093873 (owner: 10Muehlenhoff) [09:52:30] (03PS7) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) [09:53:28] (03CR) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [09:55:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:57:09] (03CR) 10Elukey: "In any case, I'd need to rework this image, probably to use nodejs20-slim as final base image with the compiled mapnik libraries only. I t" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [09:57:30] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10347431 (10cmooney) This port bounced again overnight: ` cmooney@cloudsw1-d5-eqiad> show log messages.1.gz | match "10.64.... [09:58:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10347436 (10akosiaris) Cool thanks, I 'll take over this one. [10:00:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:03:56] (03CR) 10Ladsgroup: [C:03+1] Move default main page text for new wikis to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [10:09:46] (03CR) 10Elukey: [C:04-1] "Let's see how/if a debian package works, it may be handy to just install it via blubber on a nodejs20 vanilla image." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [10:10:01] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1011.eqiad.wmnet [10:13:17] (03CR) 10Jcrespo: [C:03+2] mediabackup: Setup backup2010 as the 6th media backup host in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo) [10:13:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10347522 (10MoritzMuehlenhoff) [10:13:56] (03PS1) 10Vgutierrez: hieradata,haproxykafka: Disable haproxykafka globally [puppet] - 10https://gerrit.wikimedia.org/r/1094376 (https://phabricator.wikimedia.org/T380570) [10:14:57] (03PS2) 10Vgutierrez: hieradata,haproxykafka: Disable haproxykafka globally [puppet] - 10https://gerrit.wikimedia.org/r/1094376 (https://phabricator.wikimedia.org/T380570) [10:15:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094376 (https://phabricator.wikimedia.org/T380570) (owner: 10Vgutierrez) [10:16:01] (03CR) 10Jcrespo: [C:03+1] "+1 in that this looks syntactically correct, but must confess I have no idea what it does." [puppet] - 10https://gerrit.wikimedia.org/r/1094376 (https://phabricator.wikimedia.org/T380570) (owner: 10Vgutierrez) [10:16:21] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:17:34] (03CR) 10Vgutierrez: [C:03+2] hieradata,haproxykafka: Disable haproxykafka globally [puppet] - 10https://gerrit.wikimedia.org/r/1094376 (https://phabricator.wikimedia.org/T380570) (owner: 10Vgutierrez) [10:21:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2159.codfw.wmnet with OS bookworm [10:22:13] !log manually stopping haproxykafka on A:cp-ulsfo and A:cp-eqsin - T380570 [10:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:17] T380570: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570 [10:22:36] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1011.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:22:59] vgutierrez: you just made things 200x better, thanks vgutierrez [10:23:22] it took effect already, at least for cp5 hosts [10:23:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1011.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:23:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:23:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1011.eqiad.wmnet [10:24:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:26:04] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1014.eqiad.wmnet [10:27:24] (03PS4) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [10:27:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:28:33] (03PS5) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [10:31:58] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:32:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:28] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10347563 (10Clement_Goubert) Re-imaging because I accidentaly overwrote the partition table on the good disk with the partition table on the new disk... [10:36:18] (03PS1) 10Muehlenhoff: Update site.pp after decom [puppet] - 10https://gerrit.wikimedia.org/r/1094380 (https://phabricator.wikimedia.org/T380564) [10:37:00] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:37:05] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [10:37:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:37:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:37:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1014.eqiad.wmnet [10:40:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2159.codfw.wmnet with reason: host reimage [10:41:27] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [10:43:11] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [10:43:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2159.codfw.wmnet with reason: host reimage [10:43:26] ^ all good now [10:43:46] (03CR) 10Muehlenhoff: [C:03+2] Update site.pp after decom [puppet] - 10https://gerrit.wikimedia.org/r/1094380 (https://phabricator.wikimedia.org/T380564) (owner: 10Muehlenhoff) [10:45:52] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission ganeti1011 / ganeti1014 - https://phabricator.wikimedia.org/T380564#10347608 (10MoritzMuehlenhoff) [10:47:15] (03CR) 10Muehlenhoff: "We don't strictly need to block this FWIW. It was mostly a side comment, you could also proceed with the original approach and then we swa" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [10:48:08] (03PS6) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [10:48:35] (03CR) 10Elukey: [C:04-1] "Nono you got me thinking, since to make things right I'd need to copy the binaries to a nodejs20-slim image, that will be used as baseline" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [10:49:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [10:49:22] (03CR) 10CI reject: [V:04-1] WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [10:49:22] (03CR) 10Majavah: "minor thing inline, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [10:52:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:54:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:56:30] (03PS1) 10Clément Goubert: wikikube: Default to containerd partition layout [puppet] - 10https://gerrit.wikimedia.org/r/1094383 (https://phabricator.wikimedia.org/T362408) [10:57:07] (03PS7) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [10:58:05] (03PS1) 10Clément Goubert: wikikube: Add wikikube-worker13[13-28] [puppet] - 10https://gerrit.wikimedia.org/r/1094381 (https://phabricator.wikimedia.org/T380350) [11:02:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2159.codfw.wmnet with OS bookworm [11:04:58] !log homer 'lsw1-b7-codfw*' commit 'T377028' [11:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:03] T377028: wikikube-worker21[36-55] implementation tracking - https://phabricator.wikimedia.org/T377028 [11:06:53] (03PS1) 10Vgutierrez: Revert^2 "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1094384 [11:07:45] (03CR) 10JMeybohm: wikikube: Default to containerd partition layout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1094383 (https://phabricator.wikimedia.org/T362408) (owner: 10Clément Goubert) [11:07:48] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2140.codfw.wmnet [11:07:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2140.codfw.wmnet [11:09:48] (03CR) 10Ayounsi: WIP: example config for Nokia SR-Linux (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1084107 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [11:10:32] (03CR) 10JMeybohm: [C:03+1] sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [11:14:07] (03PS2) 10Clément Goubert: wikikube: Default to containerd partition layout [puppet] - 10https://gerrit.wikimedia.org/r/1094383 (https://phabricator.wikimedia.org/T362408) [11:14:36] (03CR) 10Clément Goubert: wikikube: Default to containerd partition layout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1094383 (https://phabricator.wikimedia.org/T362408) (owner: 10Clément Goubert) [11:17:01] (03PS2) 10Vgutierrez: Revert^2 "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1094384 [11:18:20] (03PS1) 10Muehlenhoff: package_builder: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1094387 (https://phabricator.wikimedia.org/T379343) [11:18:25] !log homer 'lsw1-b4-codfw*' commit 'T376966' [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:30] T376966: wikikube-worker21[56-70] implementation tracking - https://phabricator.wikimedia.org/T376966 [11:19:01] (03CR) 10CI reject: [V:04-1] package_builder: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1094387 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [11:19:02] !log homer 'lsw1-b7-codfw*' commit 'T376966' [11:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:47] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10347731 (10ayounsi) [11:19:47] !log homer 'lsw1-c2-codfw*' commit 'T376966' [11:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:32] !log homer 'lsw1-c4-codfw*' commit 'T376966' [11:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:37] (03PS2) 10Muehlenhoff: package_builder: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1094387 (https://phabricator.wikimedia.org/T379343) [11:20:58] (03PS3) 10Vgutierrez: Revert^2 "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1094384 [11:21:35] !log homer 'lsw1-c7-codfw*' commit 'T376966' [11:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:10] 06SRE, 10Data-Persistence-Backup, 10media-backups, 13Patch-For-Review: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10347738 (10jcrespo) Capacity reached 94.2% and finally it is on a downward trend: 93.7% 🎉 [11:22:32] !log homer 'lsw1-d1-codfw*' commit 'T376966' [11:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:20] !log homer 'lsw1-d4-codfw*' commit 'T376966' [11:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:05] !log homer 'lsw1-d5-codfw*' commit 'T376966' [11:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:08] T376966: wikikube-worker21[56-70] implementation tracking - https://phabricator.wikimedia.org/T376966 [11:24:42] !log homer 'lsw1-d6-codfw*' commit 'T376966' [11:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:23] !log homer 'lsw1-d7-codfw*' commit 'T376966' [11:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:21] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2156-2170].codfw.wmnet [11:26:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2156-2170].codfw.wmnet [11:31:59] (03CR) 10Vgutierrez: Revert^2 "haproxykafka: working on TLS client authentication to kafka" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1094384 (owner: 10Vgutierrez) [11:32:05] FIRING: [4x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:39] 06SRE, 10Data-Persistence-Backup, 10media-backups, 13Patch-For-Review: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10347768 (10jcrespo) Timestamp is in CET: ` [12:26:36] RESOLVED: DiskSpace: Disk space backup2011:9100:/srv/obj... [11:39:27] (03CR) 10Elukey: Revert^2 "haproxykafka: working on TLS client authentication to kafka" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1094384 (owner: 10Vgutierrez) [11:41:25] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241122T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241122T1200). [12:06:54] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1097723344 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:07:25] (03PS4) 10Vgutierrez: Revert^2 "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1094384 [12:07:36] (03CR) 10Vgutierrez: Revert^2 "haproxykafka: working on TLS client authentication to kafka" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1094384 (owner: 10Vgutierrez) [12:08:54] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1544 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:10:00] (03CR) 10CI reject: [V:04-1] Revert^2 "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1094384 (owner: 10Vgutierrez) [12:11:15] (03PS5) 10Vgutierrez: Revert^2 "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1094384 [12:14:37] (03PS1) 10Vgutierrez: Revert "hieradata,haproxykafka: Disable haproxykafka globally" [puppet] - 10https://gerrit.wikimedia.org/r/1094392 [12:15:21] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094392 (owner: 10Vgutierrez) [12:17:45] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 5 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1093953 (https://phabricator.wikimedia.org/T380476) (owner: 10Jelto) [12:17:49] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1094387 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [12:19:43] (03CR) 10Elukey: [C:03+1] "LGTM! An alternative path would be to have a specific flag to turn off/on mTLS to have a more incremental rollout, but it is also fine to " [puppet] - 10https://gerrit.wikimedia.org/r/1094384 (owner: 10Vgutierrez) [12:25:51] (03PS1) 10Muehlenhoff: Make docker::baseimages ensurable [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) [12:26:29] (03CR) 10CI reject: [V:04-1] Make docker::baseimages ensurable [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [12:32:32] (03PS2) 10Muehlenhoff: Make docker::baseimages ensurable [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) [12:33:09] (03CR) 10CI reject: [V:04-1] Make docker::baseimages ensurable [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [12:35:00] (03PS6) 10Vgutierrez: Revert^2 "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1094384 [12:35:00] (03PS2) 10Vgutierrez: Revert "hieradata,haproxykafka: Disable haproxykafka globally" [puppet] - 10https://gerrit.wikimedia.org/r/1094392 [12:35:52] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094392 (owner: 10Vgutierrez) [12:38:52] (03CR) 10Jelto: [V:03+1] "I'll merge this on Monday because this could affect a large number of hosts (although the diff looks good and the only change is on GitLab" [puppet] - 10https://gerrit.wikimedia.org/r/1093953 (https://phabricator.wikimedia.org/T380476) (owner: 10Jelto) [12:38:57] (03PS3) 10Muehlenhoff: Make docker::baseimages ensurable [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) [12:45:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [12:45:26] (03CR) 10Cathal Mooney: "Having gone through it all looks ok to me! I'll discuss with volans about I9eecb7db849535f2e09d0d03b5843004e937cafb and see if we can get" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [12:45:51] (03CR) 10Muehlenhoff: [C:03+2] package_builder: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1094387 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [12:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10347960 (10phaultfinder) [13:16:07] (03PS1) 10Muehlenhoff: turnilo: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1094420 [13:16:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:58] (03PS1) 10Dbrant: Revert "push-notifications: Bump image to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094421 [13:29:05] (03PS1) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) [13:29:10] FIRING: [16x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094420 (owner: 10Muehlenhoff) [13:37:05] FIRING: [18x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:05] FIRING: [18x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:40] (03PS1) 10Muehlenhoff: Switch idp-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1094426 [13:55:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094426 (owner: 10Muehlenhoff) [14:02:18] (03PS1) 10Muehlenhoff: Deprecate system::role for Openstack roles [puppet] - 10https://gerrit.wikimedia.org/r/1094434 [14:04:47] (03CR) 10Elukey: [C:03+1] Make docker::baseimages ensurable [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [14:04:58] (03PS1) 10Brouberol: postgresql-airflow-analytics-test: add helmfile and configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094435 (https://phabricator.wikimedia.org/T380591) [14:05:00] (03PS1) 10Brouberol: airflow-analytics-test: use the cloudnative PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094436 (https://phabricator.wikimedia.org/T380591) [14:08:42] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Revert "push-notifications: Bump image to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094421 (owner: 10Dbrant) [14:08:56] (03CR) 10Dbrant: [C:03+2] Revert "push-notifications: Bump image to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094421 (owner: 10Dbrant) [14:10:00] (03Merged) 10jenkins-bot: Revert "push-notifications: Bump image to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094421 (owner: 10Dbrant) [14:11:30] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1094384 (owner: 10Vgutierrez) [14:12:22] !log ihurbain@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [14:12:55] !log ihurbain@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [14:13:12] !log ihurbain@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [14:15:40] (03PS1) 10Muehlenhoff: Deprecate system::role for more Data Engineering roles [puppet] - 10https://gerrit.wikimedia.org/r/1094448 [14:18:19] (03CR) 10Vgutierrez: [C:03+2] Revert "hieradata,haproxykafka: Disable haproxykafka globally" [puppet] - 10https://gerrit.wikimedia.org/r/1094392 (owner: 10Vgutierrez) [14:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10348201 (10phaultfinder) [14:21:03] (03CR) 10Jforrester: Move default main page text for new wikis to config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [14:22:18] (03PS1) 10Ilias Sarantopoulos: admin/data.yaml: Add bearloga to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1094454 (https://phabricator.wikimedia.org/T380593) [14:22:30] !log restoring haproxykafka on A:cp-ulsfo and A:cp-eqsin - T380570 [14:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:34] T380570: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570 [14:23:43] !log ihurbain@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [14:26:23] (03CR) 10Ilias Sarantopoulos: "@tklausmann@wikimedia.org Am I missing something, or would this be sufficient?" [puppet] - 10https://gerrit.wikimedia.org/r/1094454 (https://phabricator.wikimedia.org/T380593) (owner: 10Ilias Sarantopoulos) [14:26:45] (03CR) 10Klausman: [C:03+1] admin/data.yaml: Add bearloga to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1094454 (https://phabricator.wikimedia.org/T380593) (owner: 10Ilias Sarantopoulos) [14:27:29] !log ihurbain@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [14:30:13] (03CR) 10Tiziano Fogli: [C:03+2] "Yes, I agree with you that we already have a filter for exported_cluster in place, but the additional filter changes the way Prometheus op" [alerts] - 10https://gerrit.wikimedia.org/r/1093302 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli) [14:31:48] (03Merged) 10jenkins-bot: opensearch: reduce noise of PrometheusRuleEvaluationFailures [alerts] - 10https://gerrit.wikimedia.org/r/1093302 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli) [14:31:52] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10348245 (10RobH) We were discussing this last week, and brianstormed some on https://etherpad.wikimedia.org/p/magru_server_swaps fr... [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:08] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1252462912 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:37:37] !log ihurbain@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [14:39:05] (03CR) 10JMeybohm: mw-api-int: add migration release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [14:40:08] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 37855624 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:41:49] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1094383 (https://phabricator.wikimedia.org/T362408) (owner: 10Clément Goubert) [14:43:28] (03PS20) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [14:44:08] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:44:39] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [14:44:56] (03CR) 10Brouberol: [C:03+2] airflow: allow multiple DAG folders to be pulled in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [14:45:05] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [14:45:50] (03CR) 10Klausman: [C:03+2] admin/data.yaml: Add bearloga to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1094454 (https://phabricator.wikimedia.org/T380593) (owner: 10Ilias Sarantopoulos) [14:46:28] (03Merged) 10jenkins-bot: airflow: allow multiple DAG folders to be pulled in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [14:46:35] (03Merged) 10jenkins-bot: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [14:46:38] (03CR) 10Klausman: [C:03+2] "This should be enough." [puppet] - 10https://gerrit.wikimedia.org/r/1094454 (https://phabricator.wikimedia.org/T380593) (owner: 10Ilias Sarantopoulos) [14:47:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [14:48:59] (03CR) 10Muehlenhoff: [C:03+1] "I can't foresee a specific security risk. Some applications might behave differently if PTYs are available compared to the previous mode o" [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) (owner: 10FNegri) [14:49:03] (03CR) 10Bking: [C:03+2] wdqs: create wdqs-internal-[main,scholarly] roles (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [14:49:58] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[2001-2020].codfw.wmnet [14:51:58] (03CR) 10JMeybohm: [C:03+1] wikikube: Add wikikube-worker13[13-28] [puppet] - 10https://gerrit.wikimedia.org/r/1094381 (https://phabricator.wikimedia.org/T380350) (owner: 10Clément Goubert) [14:53:39] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2022.codfw.wmnet with reason: Decommissioning — T380236 [14:53:43] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2022.codfw.wmnet with reason: Decommissioning — T380236 [14:53:44] T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236 [14:54:27] !log decommissioning Cassandra/restbase2022-{a,b,c} — [14:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:35] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10348362 (10MoritzMuehlenhoff) Ack, thanks. Either is fine with me, I can also switch them to insetup and then keep them running. [14:58:27] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2026:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:59:25] FIRING: [8x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:24] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wdqs[2026-2027].codfw.wmnet with reason: T379023 [15:01:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wdqs[2026-2027].codfw.wmnet with reason: T379023 [15:01:43] T379023: Create WDQS split endpoints for internal traffic and reconfigure clients to use the new endpoints - https://phabricator.wikimedia.org/T379023 [15:01:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:02:15] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wdqs[2018-2020].codfw.wmnet with reason: T379023 [15:02:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wdqs[2018-2020].codfw.wmnet with reason: T379023 [15:04:25] (03PS1) 10Jelto: wikidata-query-gui: bump images for gui and builder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094465 (https://phabricator.wikimedia.org/T350793) [15:05:39] (03PS1) 10Clément Goubert: wikikube: Decommission parse20[01-20] [puppet] - 10https://gerrit.wikimedia.org/r/1094466 (https://phabricator.wikimedia.org/T380473) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:47] (03CR) 10JHathaway: [C:03+1] sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:08:28] (03CR) 10JMeybohm: [C:03+1] wikikube: Decommission parse20[01-20] [puppet] - 10https://gerrit.wikimedia.org/r/1094466 (https://phabricator.wikimedia.org/T380473) (owner: 10Clément Goubert) [15:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10348464 (10phaultfinder) [15:09:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[2001-2020].codfw.wmnet [15:09:49] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1094426 (owner: 10Muehlenhoff) [15:09:59] (03PS1) 10Bking: wdqs internal endpoints: do not run wdqs-categories [puppet] - 10https://gerrit.wikimedia.org/r/1094468 (https://phabricator.wikimedia.org/T379329) [15:10:03] (03CR) 10Clément Goubert: [C:03+2] wikikube: Decommission parse20[01-20] [puppet] - 10https://gerrit.wikimedia.org/r/1094466 (https://phabricator.wikimedia.org/T380473) (owner: 10Clément Goubert) [15:10:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094468 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [15:10:39] (03CR) 10Scott French: mediawiki: add mercurius features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:11:47] !log parse[2001-2020].codfw.wmnet 'disable-puppet "decom"' - T380473 [15:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:51] T380473: Decommission parse20[01-20] - https://phabricator.wikimedia.org/T380473 [15:12:03] !log parse[2001-2020].codfw.wmnet 'systemctl stop kubelet.service' - T380473 [15:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:00] !log kubectl delete node parse20{01..20}.codfw.wmnet - T380473 [15:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:14] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [15:16:00] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [15:16:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:17:05] !log ihurbain@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [15:17:48] !log ihurbain@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [15:20:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts parse2001.codfw.wmnet [15:22:47] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm [15:22:51] (03CR) 10Brouberol: [C:03+1] wdqs internal endpoints: do not run wdqs-categories [puppet] - 10https://gerrit.wikimedia.org/r/1094468 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [15:23:44] (03CR) 10Bking: [C:03+2] wdqs internal endpoints: do not run wdqs-categories [puppet] - 10https://gerrit.wikimedia.org/r/1094468 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [15:24:21] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:22] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:44] (03PS1) 10Clément Goubert: wikikube: Decommission parse20[01-20].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1094470 (https://phabricator.wikimedia.org/T380473) [15:24:52] (03CR) 10Krinkle: [C:03+1] Disable more extensions when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094071 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [15:25:16] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:27:00] (03CR) 10CI reject: [V:04-1] wikikube: Decommission parse20[01-20].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1094470 (https://phabricator.wikimedia.org/T380473) (owner: 10Clément Goubert) [15:28:12] (03PS2) 10Clément Goubert: wikikube: Decommission parse20[01-20].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1094470 (https://phabricator.wikimedia.org/T380473) [15:29:19] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2041.codfw.wmnet with OS bookworm [15:29:25] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: parse2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [15:29:26] 06SRE, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Mailing list for zhwiki arbcom - https://phabricator.wikimedia.org/T380109#10348570 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikipedia-zh-arbcom.lists.wikimedia.org Creating wikis is a much bigger wo... [15:29:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: parse2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [15:29:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts parse2001.codfw.wmnet [15:31:06] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm [15:31:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts parse[2002-2020].codfw.wmnet [15:31:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:33:58] (03CR) 10Tiziano Fogli: "I looked into the code, and what you have is a check for wan.cloudgw.eqiad1.wikimediacloud.org (using this as an example) from both codfw " [puppet] - 10https://gerrit.wikimedia.org/r/1079531 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [15:35:07] (03PS12) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [15:36:54] (03CR) 10Bking: [C:03+2] wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [15:36:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:37:55] BGP alerts are due to my decom's, I'll run homer once they're done [15:44:18] (03CR) 10Giuseppe Lavagetto: aptrepo: add import for vopsbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093875 (owner: 10Giuseppe Lavagetto) [15:47:46] (03PS7) 10Andrew Bogott: Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) [15:47:50] (03PS3) 10Andrew Bogott: Neutron: remove linuxbridge from mechanism_drivers [puppet] - 10https://gerrit.wikimedia.org/r/1092425 (https://phabricator.wikimedia.org/T326373) [15:47:54] (03PS1) 10Andrew Bogott: neutron.conf: remove [experimental] linuxbridge section [puppet] - 10https://gerrit.wikimedia.org/r/1094471 (https://phabricator.wikimedia.org/T326373) [15:49:44] (03CR) 10Andrew Bogott: Remove support for neutron linuxbridge driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [15:49:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094471 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [15:54:27] FIRING: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10348906 (10phaultfinder) [15:56:27] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [15:56:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:27] RESOLVED: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:37] (03PS6) 10Cathal Mooney: Potential script to assign fr-tech server IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [15:59:59] (03CR) 10Cathal Mooney: Potential script to assign fr-tech server IPs and switch ports (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [16:00:39] (03CR) 10Majavah: [C:03+1] Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [16:00:45] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2041.codfw.wmnet with OS bookworm [16:03:54] (03PS2) 10Giuseppe Lavagetto: aptrepo: add import for vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/1093875 [16:05:24] !log bking@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: 0.3.150 [16:07:15] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [16:08:24] !log bking@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: 0.3.150 (duration: 03m 00s) [16:09:11] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm [16:10:40] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: parse[2002-2020].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [16:10:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: parse[2002-2020].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [16:10:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:00] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts parse[2002-2020].codfw.wmnet [16:11:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1091732 (https://phabricator.wikimedia.org/T380057) (owner: 10Majavah) [16:12:00] !log homer 'cr*codfw*' commit 'T380473' [16:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:04] T380473: Decommission parse20[01-20] - https://phabricator.wikimedia.org/T380473 [16:13:33] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) (owner: 10Majavah) [16:14:06] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1091849 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [16:14:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [16:15:22] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1091848 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [16:16:01] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [16:16:02] (03CR) 10Majavah: [C:03+2] keepalived: Split failover config template to new class [puppet] - 10https://gerrit.wikimedia.org/r/1091732 (https://phabricator.wikimedia.org/T380057) (owner: 10Majavah) [16:16:13] (03CR) 10Majavah: [C:03+2] keepalived::failover: Support IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) (owner: 10Majavah) [16:16:32] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1159305256 and 48 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:17:36] (03CR) 10Scott French: "Thanks, Janis!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [16:17:50] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 246, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:17] 06SRE, 10Data-Persistence-Backup, 10media-backups, 13Patch-For-Review: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10349001 (10jcrespo) 05Open→03Resolved This is now done. Catchup and purging is ongoing, but after that finishes, w... [16:19:20] (03CR) 10Clément Goubert: [C:03+2] wikikube: Decommission parse20[01-20].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1094470 (https://phabricator.wikimedia.org/T380473) (owner: 10Clément Goubert) [16:19:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 66272 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:23:34] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission parse20[01-20] - https://phabricator.wikimedia.org/T380473#10349027 (10Clement_Goubert) [16:24:15] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2041.codfw.wmnet with reason: host reimage [16:24:52] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 330, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:25:34] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 135346776 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:26:18] (03CR) 10Andrew Bogott: [C:03+2] Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [16:26:25] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1094434 (owner: 10Muehlenhoff) [16:26:34] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 49312 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:26:51] 07Puppet, 07IPv6, 13Patch-For-Review: Keepalived Puppet module: Support IPv6 - https://phabricator.wikimedia.org/T380057#10349037 (10taavi) 05Open→03Resolved [16:27:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2041.codfw.wmnet with reason: host reimage [16:28:02] (03PS3) 10Clément Goubert: wikikube: Default to containerd partition layout [puppet] - 10https://gerrit.wikimedia.org/r/1094383 (https://phabricator.wikimedia.org/T362408) [16:28:23] (03PS4) 10Andrew Bogott: Neutron: remove linuxbridge from mechanism_drivers [puppet] - 10https://gerrit.wikimedia.org/r/1092425 (https://phabricator.wikimedia.org/T326373) [16:28:31] (03PS2) 10Andrew Bogott: neutron.conf: remove [experimental] linuxbridge section [puppet] - 10https://gerrit.wikimedia.org/r/1094471 (https://phabricator.wikimedia.org/T326373) [16:33:44] (03PS1) 10Ebernhardson: cirrus: Dont attempt to dump s11 [puppet] - 10https://gerrit.wikimedia.org/r/1094484 [16:40:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2042.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:41:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094071 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [16:42:11] (03CR) 10Reedy: [C:03+1] build: Upgrade mediawiki/mediawiki-codesniffer from v43.0.0 to v45.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091320 (https://phabricator.wikimedia.org/T379955) (owner: 10Jforrester) [16:42:45] !log herron@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2005.codfw.wmnet to plain [16:43:11] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [16:43:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [16:43:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2041.codfw.wmnet with OS bookworm [16:43:34] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2005.codfw.wmnet to plain [16:47:29] !log herron@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain [16:48:04] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain [16:52:20] (03CR) 10Giuseppe Lavagetto: aptrepo: add import for vopsbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093875 (owner: 10Giuseppe Lavagetto) [16:53:32] !log herron@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain [16:53:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2042.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:54:14] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain [16:55:38] (03PS13) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [16:56:27] (03CR) 10AOkoth: [C:03+1] wikidata-query-gui: bump images for gui and builder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [16:57:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:57:59] (03PS1) 10EoghanGaffney: vrts: Disable more VALIDITY RBL checks from Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1094488 (https://phabricator.wikimedia.org/T380396) [16:58:01] (03CR) 10Hnowlan: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:58:38] (03CR) 10CI reject: [V:04-1] vrts: Disable more VALIDITY RBL checks from Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1094488 (https://phabricator.wikimedia.org/T380396) (owner: 10EoghanGaffney) [16:58:44] (03PS1) 10CDanis: k8s: temp. enforce maximum cluster size [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) [16:58:53] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) (owner: 10CDanis) [16:59:46] (03CR) 10AOkoth: [C:03+1] "Hahaha. I remember writing this mess. 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1093948 (https://phabricator.wikimedia.org/T380476) (owner: 10Jelto) [16:59:48] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1094488 (https://phabricator.wikimedia.org/T380396) (owner: 10EoghanGaffney) [17:00:12] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker2002.codfw.wmnet [17:00:14] !log herron@cumin1002 START - Cookbook sre.dns.netbox [17:01:38] (03PS1) 10CDanis: DO NOT SUBMIT: testing [puppet] - 10https://gerrit.wikimedia.org/r/1094490 [17:01:44] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094490 (owner: 10CDanis) [17:03:00] (03PS2) 10EoghanGaffney: vrts: Disable more VALIDITY RBL checks from Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1094488 (https://phabricator.wikimedia.org/T380396) [17:04:15] (03PS2) 10CDanis: DO NOT SUBMIT: testing [puppet] - 10https://gerrit.wikimedia.org/r/1094490 [17:04:18] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094490 (owner: 10CDanis) [17:04:38] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10349153 (10phaultfinder) [17:05:34] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker2002.codfw.wmnet - herron@cumin1002" [17:05:38] (03PS1) 10Majavah: O:openstack: Drop codfw1dev db role [puppet] - 10https://gerrit.wikimedia.org/r/1094491 (https://phabricator.wikimedia.org/T369308) [17:05:39] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker2002.codfw.wmnet - herron@cumin1002" [17:05:39] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:05:39] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-worker2002.codfw.wmnet on all recursors [17:05:40] (03PS1) 10Majavah: O:openstack: cloudweb: Drop nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1094492 (https://phabricator.wikimedia.org/T371378) [17:05:42] (03PS1) 10Majavah: nutcracker: Remove module (and related code) [puppet] - 10https://gerrit.wikimedia.org/r/1094493 [17:05:42] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker2002.codfw.wmnet on all recursors [17:06:08] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker2002.codfw.wmnet - herron@cumin1002" [17:06:12] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker2002.codfw.wmnet - herron@cumin1002" [17:06:56] (03PS1) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [17:07:02] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4576/co" [puppet] - 10https://gerrit.wikimedia.org/r/1094493 (owner: 10Majavah) [17:07:49] (03PS2) 10CDanis: k8s: temp. enforce maximum cluster size [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) [17:07:49] (03PS3) 10CDanis: DO NOT SUBMIT: testing [puppet] - 10https://gerrit.wikimedia.org/r/1094490 [17:07:54] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) (owner: 10CDanis) [17:08:07] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2002.codfw.wmnet with OS bookworm [17:08:14] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10349179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-worker2002.codfw.wmnet with OS bookworm [17:09:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:11:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:13:04] (03PS2) 10Ebernhardson: cirrus: Dont attempt to dump s11 [puppet] - 10https://gerrit.wikimedia.org/r/1094484 (https://phabricator.wikimedia.org/T378260) [17:14:46] (03CR) 10CDanis: "This is a no-op in production: https://puppet-compiler.wmflabs.org/output/1094489/5002/" [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) (owner: 10CDanis) [17:20:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:20:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:22:56] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2002.codfw.wmnet with reason: host reimage [17:22:58] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudsw1-d5-eqiad.mgmt,cloudsw1-e4-eqiad.mgmt with reason: replace optics on faulty WMCS link from D5 to E4 [17:23:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudsw1-d5-eqiad.mgmt,cloudsw1-e4-eqiad.mgmt with reason: replace optics on faulty WMCS link from D5 to E4 [17:23:23] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10349255 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6b283bec-74b8-4f8c-9a46-f9f60c9c4026) set by c... [17:23:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:25:38] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2002.codfw.wmnet with reason: host reimage [17:27:57] (03CR) 10RLazarus: mediawiki: add mercurius features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:28:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host es2042 [17:28:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2042 [17:30:39] (03CR) 10AOkoth: [C:03+1] vrts: Disable more VALIDITY RBL checks from Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1094488 (https://phabricator.wikimedia.org/T380396) (owner: 10EoghanGaffney) [17:31:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:32:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:33:54] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10349290 (10VRiley-WMF) Replaced the transciever in cloudsw1-e4-eqiad et-0/0/54. Will test to see if that works. Trying to... [17:41:40] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2002.codfw.wmnet with OS bookworm [17:41:40] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker2002.codfw.wmnet [17:41:49] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10349315 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-worker2002.codfw.wmnet with OS bookworm completed: - aux-k8s-worker2002 (**PASS**)... [17:42:05] FIRING: [16x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:52:29] (03CR) 10Xcollazo: [C:03+1] cirrus: Dont attempt to dump s11 [puppet] - 10https://gerrit.wikimedia.org/r/1094484 (https://phabricator.wikimedia.org/T378260) (owner: 10Ebernhardson) [17:56:04] (03PS3) 10CDanis: k8s: temp. enforce maximum cluster size [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) [17:56:05] (03PS4) 10CDanis: DO NOT SUBMIT: testing [puppet] - 10https://gerrit.wikimedia.org/r/1094490 [17:56:51] (03CR) 10CI reject: [V:04-1] DO NOT SUBMIT: testing [puppet] - 10https://gerrit.wikimedia.org/r/1094490 (owner: 10CDanis) [17:58:47] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:00:01] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10349375 (10cmooney) Thanks @VRiley-WMF! Seems ok so far but we can make a call Monday based on if we see errors on the li... [18:02:01] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker2003.codfw.wmnet [18:02:15] (03PS4) 10CDanis: k8s: temp. enforce maximum cluster size [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) [18:02:15] (03PS5) 10CDanis: DO NOT SUBMIT: testing [puppet] - 10https://gerrit.wikimedia.org/r/1094490 [18:02:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2042 to codfw - jhancock@cumin2002" [18:02:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2042 to codfw - jhancock@cumin2002" [18:02:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:03:04] (03CR) 10CI reject: [V:04-1] DO NOT SUBMIT: testing [puppet] - 10https://gerrit.wikimedia.org/r/1094490 (owner: 10CDanis) [18:03:17] !log herron@cumin1002 START - Cookbook sre.dns.netbox [18:05:57] (03CR) 10CDanis: "And now, such a hieradata patch would be rejected by CI: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094490/5#message-da5f97d047" [puppet] - 10https://gerrit.wikimedia.org/r/1094489 (https://phabricator.wikimedia.org/T375845) (owner: 10CDanis) [18:09:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2042.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:10:10] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker2003.codfw.wmnet - herron@cumin1002" [18:10:15] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker2003.codfw.wmnet - herron@cumin1002" [18:10:15] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:10:15] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-worker2003.codfw.wmnet on all recursors [18:10:18] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker2003.codfw.wmnet on all recursors [18:10:47] (03PS2) 10D3r1ck01: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) [18:10:50] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker2003.codfw.wmnet - herron@cumin1002" [18:10:54] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker2003.codfw.wmnet - herron@cumin1002" [18:11:19] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2003.codfw.wmnet with OS bookworm [18:11:33] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10349405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-worker2003.codfw.wmnet with OS bookworm [18:13:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2042.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:17:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2042.codfw.wmnet with OS bookworm [18:17:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2042.codfw.wmnet with OS bookworm [18:26:18] (03PS3) 10Ebernhardson: cirrus: Dont attempt to dump s11 [puppet] - 10https://gerrit.wikimedia.org/r/1094484 (https://phabricator.wikimedia.org/T378260) [18:26:21] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094484 (https://phabricator.wikimedia.org/T378260) (owner: 10Ebernhardson) [18:27:46] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2003.codfw.wmnet with reason: host reimage [18:31:04] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2003.codfw.wmnet with reason: host reimage [18:32:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2042.codfw.wmnet with reason: host reimage [18:33:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1094492 (https://phabricator.wikimedia.org/T371378) (owner: 10Majavah) [18:33:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093875 (owner: 10Giuseppe Lavagetto) [18:35:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2042.codfw.wmnet with reason: host reimage [18:35:49] (03PS1) 10Dbrant: New stream config for Android Rabbit Holes feature. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094511 (https://phabricator.wikimedia.org/T380107) [18:44:10] FIRING: [18x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:45:26] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2003.codfw.wmnet with OS bookworm [18:45:26] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker2003.codfw.wmnet [18:45:49] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10349519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-worker2003.codfw.wmnet with OS bookworm completed: - aux-k8s-worker2003 (**PASS**)... [18:50:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2046.codfw.wmnet with OS bookworm [18:50:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2045.codfw.wmnet with OS bookworm [18:50:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2044.codfw.wmnet with OS bookworm [18:50:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2043.codfw.wmnet with OS bookworm [18:50:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349524 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2046.codfw.wmnet with OS bookworm [18:50:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349525 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2045.codfw.wmnet with OS bookworm [18:50:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349527 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2043.codfw.wmnet with OS bookworm [18:50:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349526 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2044.codfw.wmnet with OS bookworm [18:52:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:53:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:53:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2042.codfw.wmnet with OS bookworm [18:53:16] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349532 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2042.codfw.wmnet with OS bookworm complete... [18:58:55] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker2004.codfw.wmnet [18:58:56] !log herron@cumin1002 START - Cookbook sre.dns.netbox [19:05:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2044.codfw.wmnet with reason: host reimage [19:05:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2043.codfw.wmnet with reason: host reimage [19:05:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2046.codfw.wmnet with reason: host reimage [19:05:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2045.codfw.wmnet with reason: host reimage [19:09:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2044.codfw.wmnet with reason: host reimage [19:09:50] (03CR) 10EoghanGaffney: [C:03+2] vrts: Disable more VALIDITY RBL checks from Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1094488 (https://phabricator.wikimedia.org/T380396) (owner: 10EoghanGaffney) [19:10:14] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker2004.codfw.wmnet - herron@cumin1002" [19:10:18] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker2004.codfw.wmnet - herron@cumin1002" [19:10:18] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:18] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-worker2004.codfw.wmnet on all recursors [19:10:21] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker2004.codfw.wmnet on all recursors [19:10:47] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker2004.codfw.wmnet - herron@cumin1002" [19:10:51] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker2004.codfw.wmnet - herron@cumin1002" [19:13:06] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2004.codfw.wmnet with OS bookworm [19:13:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2043.codfw.wmnet with reason: host reimage [19:13:13] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10349605 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-worker2004.codfw.wmnet with OS bookworm [19:16:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2046.codfw.wmnet with reason: host reimage [19:19:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2045.codfw.wmnet with reason: host reimage [19:26:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:27:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:27:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2044.codfw.wmnet with OS bookworm [19:27:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2044.codfw.wmnet with OS bookworm complete... [19:27:51] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2004.codfw.wmnet with reason: host reimage [19:29:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:30:49] (03CR) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [19:31:28] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2004.codfw.wmnet with reason: host reimage [19:32:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:32:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2043.codfw.wmnet with OS bookworm [19:32:31] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2043.codfw.wmnet with OS bookworm complete... [19:32:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:35:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:35:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2046.codfw.wmnet with OS bookworm [19:35:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2046.codfw.wmnet with OS bookworm complete... [19:36:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:36:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:36:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2045.codfw.wmnet with OS bookworm [19:36:53] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2045.codfw.wmnet with OS bookworm complete... [19:37:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349681 (10Jhancock.wm) [19:42:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10349691 (10Jhancock.wm) 05Open→03Resolved @ABran-WMF finished [19:47:41] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2004.codfw.wmnet with OS bookworm [19:47:42] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker2004.codfw.wmnet [19:47:51] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10349719 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-worker2004.codfw.wmnet with OS bookworm completed: - aux-k8s-worker2004 (**PASS**)... [20:00:03] PROBLEM - Host parse2017 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:50] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker2005.codfw.wmnet [20:07:51] !log herron@cumin1002 START - Cookbook sre.dns.netbox [20:17:17] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker2005.codfw.wmnet - herron@cumin1002" [20:17:21] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker2005.codfw.wmnet - herron@cumin1002" [20:17:21] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:17:21] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-worker2005.codfw.wmnet on all recursors [20:17:25] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker2005.codfw.wmnet on all recursors [20:17:54] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker2005.codfw.wmnet - herron@cumin1002" [20:17:59] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker2005.codfw.wmnet - herron@cumin1002" [20:20:09] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2005.codfw.wmnet with OS bookworm [20:20:19] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10349819 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-worker2005.codfw.wmnet with OS bookworm [20:26:55] FIRING: MaxConntrack: Max conntrack at 94.6% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:31:55] RESOLVED: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:37:55] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2005.codfw.wmnet with reason: host reimage [20:40:53] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [20:41:29] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2005.codfw.wmnet with reason: host reimage [20:48:51] (03CR) 10Scott French: mediawiki: add mercurius features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [20:50:56] (03PS1) 10Bking: service.yaml: add dummy config to quash PCC failures [puppet] - 10https://gerrit.wikimedia.org/r/1094530 (https://phabricator.wikimedia.org/T379329) [20:51:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094530 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [20:53:39] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094484 (https://phabricator.wikimedia.org/T378260) (owner: 10Ebernhardson) [20:55:16] (03CR) 10Ebernhardson: [C:03+1] service.yaml: add dummy config to quash PCC failures [puppet] - 10https://gerrit.wikimedia.org/r/1094530 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [20:56:22] (03PS2) 10Bking: service.yaml: add dummy config to quash PCC failures [puppet] - 10https://gerrit.wikimedia.org/r/1094530 (https://phabricator.wikimedia.org/T379329) [20:57:22] (03CR) 10Bking: [C:03+2] service.yaml: add dummy config to quash PCC failures [puppet] - 10https://gerrit.wikimedia.org/r/1094530 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [20:59:07] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2005.codfw.wmnet with OS bookworm [20:59:07] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker2005.codfw.wmnet [20:59:20] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10349902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-worker2005.codfw.wmnet with OS bookworm completed: - aux-k8s-worker2005 (**PASS**)... [20:59:42] (03CR) 10Gergő Tisza: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [21:00:26] (03CR) 10Bking: [C:03+2] cirrus: Dont attempt to dump s11 [puppet] - 10https://gerrit.wikimedia.org/r/1094484 (https://phabricator.wikimedia.org/T378260) (owner: 10Ebernhardson) [21:09:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal-main.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:10:34] ^^ Pretty sure that is related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094530 [21:10:37] checking now [21:14:02] confirmed, pushing a puppet patch to fix shortly [21:14:36] (03PS15) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [21:14:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal-main.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:14:54] (03CR) 10Bking: [C:03+2] wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [21:14:57] (03CR) 10Bking: [V:03+2 C:03+2] wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [21:20:10] (03PS3) 10D3r1ck01: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) [21:20:45] (03PS4) 10D3r1ck01: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) [21:21:06] (03CR) 10D3r1ck01: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [21:25:15] !log bking@cumin2002 conftool action : set/pooled=yes:weight=1; selector: cluster=wdqs-main,service=wdqs-internal-main [21:25:27] !log bking@cumin2002 conftool action : set/pooled=yes:weight=1; selector: cluster=wdqs-scholarly,service=wdqs-internal-scholarly [21:25:48] OK...that should hopefully clear those confd alerts. Still watching [21:31:22] still failing, running `/usr/local/bin/dump-conftool-pools --output` from config-master2001 to see what's up [21:33:31] !log bking@cumin2002 conftool action : set/weight=1; selector: name=wdqs2018.codfw.wmnet [21:33:39] !log bking@cumin2002 conftool action : set/weight=1; selector: name=wdqs2026.codfw.wmnet [21:37:03] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=wdqs2018.codfw.wmnet [21:37:11] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=wdqs2026.codfw.wmnet [21:38:59] (03PS1) 10Ahmon Dancy: scap::spiderpig: New class for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [21:39:37] (03CR) 10CI reject: [V:04-1] scap::spiderpig: New class for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (owner: 10Ahmon Dancy) [21:40:38] (03PS2) 10Ahmon Dancy: scap::spiderpig: New class for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [21:51:35] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs-internal-scholarly,name=eqiad [21:54:17] (03PS3) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [21:56:27] (03PS7) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [21:56:27] (03PS4) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [21:56:35] (03CR) 10CI reject: [V:04-1] profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (owner: 10Ahmon Dancy) [21:59:32] (03CR) 10CI reject: [V:04-1] profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (owner: 10Ahmon Dancy) [22:13:09] (03PS1) 10Bking: wdqs-internal-main and wdqs-internal-scholarly: Fix conftool data [puppet] - 10https://gerrit.wikimedia.org/r/1094536 (https://phabricator.wikimedia.org/T379329) [22:13:15] (03PS7) 10Aleksandar Mastilovic: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [22:13:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094536 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [22:13:42] (03PS5) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [22:15:23] (03CR) 10Ssingh: [C:03+1] wdqs-internal-main and wdqs-internal-scholarly: Fix conftool data [puppet] - 10https://gerrit.wikimedia.org/r/1094536 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [22:18:28] (03CR) 10Bking: [V:03+2 C:03+2] wdqs-internal-main and wdqs-internal-scholarly: Fix conftool data [puppet] - 10https://gerrit.wikimedia.org/r/1094536 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [22:20:41] (03CR) 10Ahmon Dancy: "This is running on deployment-deploy04 at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (owner: 10Ahmon Dancy) [22:25:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:27:27] (03PS1) 10Bking: wdqs-internal-[main,scholarly]: deactivate LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) [22:29:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:05] (03PS2) 10Bking: wdqs-internal-[main,scholarly]: deactivate LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) [22:43:04] (03PS1) 10Bking: Revert "wdqs-internal-main and wdqs-internal-scholarly: Fix conftool data" [puppet] - 10https://gerrit.wikimedia.org/r/1094547 [22:44:47] (03CR) 10Bking: [C:03+2] Revert "wdqs-internal-main and wdqs-internal-scholarly: Fix conftool data" [puppet] - 10https://gerrit.wikimedia.org/r/1094547 (owner: 10Bking) [22:45:41] (03PS1) 10Bking: Revert "wdqs: new pybal pools for internal graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1094548 [22:46:55] (03CR) 10Bking: [C:03+2] Revert "wdqs: new pybal pools for internal graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1094548 (owner: 10Bking) [22:47:05] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:40] (03PS1) 10Bking: Revert "service.yaml: add dummy config to quash PCC failures" [puppet] - 10https://gerrit.wikimedia.org/r/1094550 [22:57:06] (03CR) 10Bking: [C:03+2] Revert "service.yaml: add dummy config to quash PCC failures" [puppet] - 10https://gerrit.wikimedia.org/r/1094550 (owner: 10Bking) [23:04:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal-main.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:06:09] (03PS1) 10Bking: Revert "wdqs: create wdqs-internal-[main,scholarly] roles" [puppet] - 10https://gerrit.wikimedia.org/r/1094554 [23:06:20] (03CR) 10CI reject: [V:04-1] Revert "wdqs: create wdqs-internal-[main,scholarly] roles" [puppet] - 10https://gerrit.wikimedia.org/r/1094554 (owner: 10Bking) [23:09:41] RESOLVED: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal-main.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:14:55] (03PS1) 10RLazarus: admin: Upgrade jasmine from ops-limited to ops [puppet] - 10https://gerrit.wikimedia.org/r/1094564 [23:16:20] (03CR) 10Jasmine: [C:03+1] admin: Upgrade jasmine from ops-limited to ops [puppet] - 10https://gerrit.wikimedia.org/r/1094564 (owner: 10RLazarus) [23:18:42] (03PS3) 10Bking: wdqs-internal-[main,scholarly]: deactivate LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) [23:18:54] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4577/console" [puppet] - 10https://gerrit.wikimedia.org/r/1094564 (owner: 10RLazarus) [23:19:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [23:23:16] (03PS4) 10Bking: wdqs-internal-[main,scholarly]: deactivate LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) [23:23:24] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [23:23:33] (03CR) 10CDanis: [C:03+1] admin: Upgrade jasmine from ops-limited to ops [puppet] - 10https://gerrit.wikimedia.org/r/1094564 (owner: 10RLazarus) [23:24:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:25:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:28:46] (03CR) 10RLazarus: [V:03+1 C:03+2] admin: Upgrade jasmine from ops-limited to ops [puppet] - 10https://gerrit.wikimedia.org/r/1094564 (owner: 10RLazarus) [23:30:13] (03PS5) 10Bking: wdqs-internal-[main,scholarly]: deactivate LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) [23:30:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [23:31:34] (03PS6) 10Bking: wdqs-internal-[main,scholarly]: deactivate LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) [23:32:09] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [23:32:59] (03PS7) 10Bking: wdqs-internal-[main,scholarly]: deactivate LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) [23:34:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [23:36:27] (03PS1) 10Bking: Revert "wdqs-internal: add envoy config for graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1094570 [23:37:23] (03CR) 10Bking: [C:03+2] Revert "wdqs-internal: add envoy config for graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1094570 (owner: 10Bking) [23:40:41] (03CR) 10Scott French: [C:03+1] "Looks like this will work as a simpler alternative to wrangling conflicts on a revert of 5c2fe3451c7d537e1cf78bcccd3fec3198642a07." [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [23:41:34] (03CR) 10Bking: [C:03+2] wdqs-internal-[main,scholarly]: deactivate LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1094539 (https://phabricator.wikimedia.org/T379329) (owner: 10Bking) [23:43:49] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094484 (https://phabricator.wikimedia.org/T378260) (owner: 10Ebernhardson) [23:50:11] 06SRE-OnFire, 10Incident Tooling: corto: only operate on applicable phabricator issues - https://phabricator.wikimedia.org/T380293#10350272 (10Eevans) p:05Triage→03Medium a:03Eevans