[00:09:37] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:10:37] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:11:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140550 [00:11:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140550 (owner: 10TrainBranchBot) [00:32:59] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140550 (owner: 10TrainBranchBot) [01:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:28:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:39:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10784843 (10phaultfinder) [02:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:01] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 2054534256 and 110 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:12:01] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 157064 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:13:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:33:25] FIRING: SystemdUnitFailed: wmf_auto_restart_uwsgi-netbox.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:38:43] FIRING: [149x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:09] mutante: yes [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10784916 (10Stevemunene) Ack, thanks @Jclark-ctr. Proceeding with the rest of the steps [05:22:42] (03PS1) 10Marostegui: mariadb: Add pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1140557 (https://phabricator.wikimedia.org/T393110) [05:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:23:35] (03PS2) 10Marostegui: mariadb: Add pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1140557 (https://phabricator.wikimedia.org/T393110) [05:24:08] (03CR) 10Marostegui: [C:03+2] mariadb: Add pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1140557 (https://phabricator.wikimedia.org/T393110) (owner: 10Marostegui) [05:26:15] (03PS1) 10Marostegui: installserver: Format pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1140558 (https://phabricator.wikimedia.org/T393110) [05:28:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:28:39] (03CR) 10Marostegui: [C:03+2] installserver: Format pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1140558 (https://phabricator.wikimedia.org/T393110) (owner: 10Marostegui) [05:29:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10784927 (10Marostegui) a:05Marostegui→03None Puppet patches are done [05:29:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10784929 (10Marostegui) [05:32:09] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10784933 (10Marostegui) 05Open→03Resolved All good from my side @Jhancock.wm: ` root@db2176:~# sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli communication: 0 OK | controller: 0 OK |... [05:34:20] (03PS1) 10Stevemunene: Readd an-worker1166-68 to cluster [puppet] - 10https://gerrit.wikimedia.org/r/1140559 (https://phabricator.wikimedia.org/T390170) [05:34:36] (03PS1) 10Marostegui: mariadb: Add es204[78] [puppet] - 10https://gerrit.wikimedia.org/r/1140560 (https://phabricator.wikimedia.org/T393106) [05:35:45] (03PS2) 10Marostegui: mariadb: Add es204[78] [puppet] - 10https://gerrit.wikimedia.org/r/1140560 (https://phabricator.wikimedia.org/T393106) [05:44:42] (03CR) 10Marostegui: [C:03+2] mariadb: Add es204[78] [puppet] - 10https://gerrit.wikimedia.org/r/1140560 (https://phabricator.wikimedia.org/T393106) (owner: 10Marostegui) [05:46:43] (03PS1) 10Marostegui: installserver: Format es2047|es2048 [puppet] - 10https://gerrit.wikimedia.org/r/1140562 (https://phabricator.wikimedia.org/T393106) [05:48:53] (03CR) 10Marostegui: [C:03+2] installserver: Format es2047|es2048 [puppet] - 10https://gerrit.wikimedia.org/r/1140562 (https://phabricator.wikimedia.org/T393106) (owner: 10Marostegui) [05:49:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10784944 (10Marostegui) a:05Marostegui→03None Patches are done [05:49:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10784946 (10Marostegui) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250502T0600) [06:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:14] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1166.eqiad.wmnet [06:13:51] stevemunene@cumin1002 init-hadoop-workers (PID 1667940) is awaiting input [06:14:21] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1166.eqiad.wmnet [06:18:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:18:46] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1167.eqiad.wmnet [06:21:15] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1167.eqiad.wmnet [06:27:51] (03PS1) 10Marostegui: mariadb: Add es104[78] [puppet] - 10https://gerrit.wikimedia.org/r/1140564 (https://phabricator.wikimedia.org/T393107) [06:28:00] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding marktraceur [puppet] - 10https://gerrit.wikimedia.org/r/1140474 (owner: 10Slyngshede) [06:28:36] (03CR) 10Marostegui: [C:03+2] mariadb: Add es104[78] [puppet] - 10https://gerrit.wikimedia.org/r/1140564 (https://phabricator.wikimedia.org/T393107) (owner: 10Marostegui) [06:30:23] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging MarkTraceur out of all services on: 2404 hosts [06:31:56] (03PS1) 10Marostegui: installserver: Format es1047|es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1140565 (https://phabricator.wikimedia.org/T393107) [06:32:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10784975 (10Marostegui) a:05Marostegui→03None [06:32:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10784979 (10Marostegui) Puppet patches are done [06:34:07] (03CR) 10Marostegui: [C:03+2] installserver: Format es1047|es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1140565 (https://phabricator.wikimedia.org/T393107) (owner: 10Marostegui) [06:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:42:43] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1167.eqiad.wmnet [06:46:03] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1167.eqiad.wmnet [06:54:48] (03PS1) 10Arturo Borrero Gonzalez: Revert "admin: temporarily remove ssh key for aborrero" [puppet] - 10https://gerrit.wikimedia.org/r/1140567 [06:56:41] (03CR) 10David Caro: [C:03+1] "verified face to face" [puppet] - 10https://gerrit.wikimedia.org/r/1140567 (owner: 10Arturo Borrero Gonzalez) [06:57:30] (03CR) 10Majavah: [C:03+2] Revert "admin: temporarily remove ssh key for aborrero" [puppet] - 10https://gerrit.wikimedia.org/r/1140567 (owner: 10Arturo Borrero Gonzalez) [06:59:30] (03CR) 10Joal: "I'm not good at Prometheus metrics so the I'm not validating the code. I +1 the functional aspect though :)" [alerts] - 10https://gerrit.wikimedia.org/r/1136383 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250502T0700) [07:02:28] (03PS1) 10David Caro: Revert "admin: temporarily remove dcaro access" [puppet] - 10https://gerrit.wikimedia.org/r/1140569 [07:05:37] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "in-person verification." [puppet] - 10https://gerrit.wikimedia.org/r/1140569 (owner: 10David Caro) [07:05:52] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] Revert "admin: temporarily remove dcaro access" [puppet] - 10https://gerrit.wikimedia.org/r/1140569 (owner: 10David Caro) [07:10:46] (03PS1) 10Majavah: P:puppetserver::wmcs: Fix unattended-upgrades openjdk exclude [puppet] - 10https://gerrit.wikimedia.org/r/1140572 [07:12:02] (03PS2) 10Majavah: P:puppetserver::wmcs: Fix unattended-upgrades openjdk exclude [puppet] - 10https://gerrit.wikimedia.org/r/1140572 [07:13:29] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1140572 (owner: 10Majavah) [07:13:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:17:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5431/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140572 (owner: 10Majavah) [07:17:39] (03CR) 10Majavah: [V:03+1 C:03+2] P:puppetserver::wmcs: Fix unattended-upgrades openjdk exclude [puppet] - 10https://gerrit.wikimedia.org/r/1140572 (owner: 10Majavah) [07:27:06] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10785049 (10tappof) 05Open→03Resolved a:03tappof [07:30:25] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140 (10cmassaro) 03NEW [07:33:25] FIRING: SystemdUnitFailed: wmf_auto_restart_uwsgi-netbox.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:33:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T393066#10785083 (10tappof) [07:34:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T393066#10785084 (10tappof) [07:38:43] FIRING: [149x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T393066#10785107 (10tappof) [07:41:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T393066#10785119 (10tappof) 05Open→03In progress @Ospingou, could you please sign off on the access request? Thank you! [07:47:43] 06SRE, 10ChangeProp, 06cloud-services-team, 06collaboration-services, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#10785130 (10hashar) Redis is now available under the AGPLv3. That was announced by their CEO at https://red... [07:50:57] (03CR) 10Fabfur: [C:03+1] trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [07:51:43] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10785150 (10tappof) [08:00:26] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: add new frack hosts for basic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1140541 (https://phabricator.wikimedia.org/T386259) (owner: 10Dwisehaupt) [08:00:39] (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961) (owner: 10Dwisehaupt) [08:04:27] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146 (10MoritzMuehlenhoff) 03NEW [08:04:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10785201 (10MoritzMuehlenhoff) >>! In T391854#10783203, @MatthewVernon wrote: > - @MoritzMuehlenhoff has packaged `storcli` for deplo... [08:05:06] (03CR) 10Filippo Giunchedi: [V:03+1] "I think" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [08:05:42] 06SRE, 10ChangeProp, 06cloud-services-team, 06collaboration-services, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#10785204 (10akosiaris) >>! In T360596#10785130, @hashar wrote: > Redis is now available under the AGPLv3. T... [08:07:12] (03CR) 10Ayounsi: sre.hosts.rename: wipe DNS cache after rename (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [08:07:41] (03PS3) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) [08:08:29] PROBLEM - Ensure traffic_server is running for instance backend on cp1110 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:09:29] RECOVERY - Ensure traffic_server is running for instance backend on cp1110 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:09:38] !log jmm@cumin1002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [08:09:49] ^^ don't know why it triggered, it looked ok when I checked on cp1110 [08:09:59] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [08:10:03] (03CR) 10Fabfur: haproxykafka: service unit brought by deb package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [08:12:37] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:13:04] !log push pfw policies - T393098 [08:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:30] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10785215 (10tappof) 05Open→03In progress [08:13:58] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5433/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [08:16:08] !log jmm@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org [08:17:28] (03CR) 10Klausman: [C:03+1] admin_ng: enable Knative's secure-pod-defaults for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140140 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:19:48] (03CR) 10Alexandros Kosiaris: [C:03+1] "Yup, a change like that to confd is unlikely. This looks pretty good to me as a solution." [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway) [08:20:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10785229 (10Stevemunene) Added the logical drives on all the hosts ` stevemunene@an-worker1168:~$... [08:21:28] (03CR) 10Filippo Giunchedi: [V:03+1] "* I'm not sure removing quickdatacopy from the catalog actually disables sync'ing, how did you test this ?" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [08:23:39] (03CR) 10Fabfur: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [08:28:24] (03PS4) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) [08:29:08] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [08:29:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1140080 (owner: 10Slyngshede) [08:29:30] !log update codfw pfw NAT - T392843 [08:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:32] (03CR) 10Slyngshede: [C:03+2] Permission management: Add pagination to log [software/bitu] - 10https://gerrit.wikimedia.org/r/1140080 (owner: 10Slyngshede) [08:30:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet [08:31:17] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10785246 (10tappof) Hi @cmassaro, It looks like the "Requested group membership" field is missing from your form. Could you please let us know which group(s) you need to be added to? Thanks! [08:34:09] (03Merged) 10jenkins-bot: Permission management: Add pagination to log [software/bitu] - 10https://gerrit.wikimedia.org/r/1140080 (owner: 10Slyngshede) [08:34:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet [08:49:27] (03CR) 10Jcrespo: "@Moritz The main blocker -and should be attended soon (but not impacting this)- is that transfer.py should be updated to allow managing nf" [puppet] - 10https://gerrit.wikimedia.org/r/1133359 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [08:49:40] (03PS11) 10Máté Szabó: Enable electionclerk user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [08:52:10] (03CR) 10Jcrespo: "CC @Mvernon & @ladsgroup as I think they use transfer.py outside of my work as backups (and would love a review when done for speedup)." [puppet] - 10https://gerrit.wikimedia.org/r/1133359 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [08:53:48] (03CR) 10Jcrespo: [C:03+1] microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [08:56:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:58:36] (03CR) 10Jcrespo: [C:03+1] "This is ok-ish, I think we may need to dedicate a new backup host for gerrit hosts + gitlab, as I think we are close to maximize space on " [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [08:58:45] (03PS1) 10MVernon: install_server: UEFI setup for thanos-be[1-2]00[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1140644 (https://phabricator.wikimedia.org/T392908) [08:59:55] (03PS2) 10MVernon: install_server: UEFI setup for thanos-be[1-2]00[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1140644 (https://phabricator.wikimedia.org/T392908) [09:03:19] (03CR) 10Jcrespo: [C:03+1] "Same comment than on the other host: if we are tripling or doubling the storage, we may need additional work. I would like to be around wh" [puppet] - 10https://gerrit.wikimedia.org/r/1140506 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [09:06:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10785418 (10MatthewVernon) a:05MatthewVernon→03None I've checked, and ms-be109* is already setup for new-style storage & UEFI booting, so no puppet ch... [09:08:54] (03CR) 10Hashar: gerrit: add a second replica, start replicating to gerrit2003 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [09:14:24] (03CR) 10Jcrespo: [C:03+1] "I checked only the bash regex, not the logic." [puppet] - 10https://gerrit.wikimedia.org/r/1140644 (https://phabricator.wikimedia.org/T392908) (owner: 10MVernon) [09:14:48] (03PS1) 10Hashar: admin: hashar: sync up shell aliases [puppet] - 10https://gerrit.wikimedia.org/r/1140648 [09:21:01] (03CR) 10MVernon: [C:03+2] install_server: UEFI setup for thanos-be[1-2]00[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1140644 (https://phabricator.wikimedia.org/T392908) (owner: 10MVernon) [09:22:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10785487 (10MatthewVernon) a:05MatthewVernon→03None [09:22:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10785488 (10MatthewVernon) a:05MatthewVernon→03None [09:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:26:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [09:31:24] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1167.eqiad.wmnet [09:32:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [09:33:30] (03PS1) 10Krinkle: tests: Fix dynamic property warning in DNSSRVRecordTest and DBRecordCacheTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140652 [09:33:30] (03PS1) 10Krinkle: multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) [09:34:17] (03CR) 10CI reject: [V:04-1] multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [09:35:09] (03PS2) 10Krinkle: multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) [09:36:32] stevemunene@cumin1002 init-hadoop-workers (PID 1697027) is awaiting input [09:37:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [09:41:24] stevemunene@cumin1002 init-hadoop-workers (PID 1697027) is awaiting input [09:41:24] jouncebot: nowandnext [09:41:24] For the next 21 hour(s) and 18 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250502T0700) [09:41:24] In 1 hour(s) and 18 minute(s): GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250502T1100) [09:42:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [09:51:13] (03PS1) 10Krinkle: tests: Move buildLogoHTML.php to tests/ alongside buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140658 [09:53:48] (03PS1) 10Muehlenhoff: Initial Puppet agent apt config for Puppet 7 in Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140659 (https://phabricator.wikimedia.org/T392790) [09:54:52] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1167.eqiad.wmnet [09:55:01] (03PS3) 10Krinkle: multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) [09:55:01] (03PS2) 10Krinkle: tests: Move buildLogoHTML.php to tests/ alongside buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140658 [09:55:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2001.codfw.wmnet [09:56:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:00:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2001.codfw.wmnet [10:04:17] (03CR) 10Kamila Součková: [C:03+1] mw:periodic_job:kubernetes: quote job description [puppet] - 10https://gerrit.wikimedia.org/r/1140548 (owner: 10Scott French) [10:06:37] !log imported ruby-concurrent 1.1.6+dfsg-5~wmf13u1 to component/puppet7 for trixie-wikimedia T392790 [10:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:40] T392790: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790 [10:09:25] (03PS1) 10Novem Linguae: core-Permissions: refactor enwiki wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140661 [10:13:10] (03PS1) 10Stevemunene: hdfs: set an-worker116[6-8] to in setup role [puppet] - 10https://gerrit.wikimedia.org/r/1140663 (https://phabricator.wikimedia.org/T390170) [10:16:47] (03CR) 10Btullis: [C:03+1] hdfs: set an-worker116[6-8] to in setup role [puppet] - 10https://gerrit.wikimedia.org/r/1140663 (https://phabricator.wikimedia.org/T390170) (owner: 10Stevemunene) [10:17:17] (03CR) 10Stevemunene: [C:03+2] hdfs: set an-worker116[6-8] to in setup role [puppet] - 10https://gerrit.wikimedia.org/r/1140663 (https://phabricator.wikimedia.org/T390170) (owner: 10Stevemunene) [10:22:00] (03PS2) 10Filippo Giunchedi: kubernetes: remove master usage of prometheus_all_nodes, Prometheus has access by default [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) [10:22:26] (03CR) 10Filippo Giunchedi: "Yes that is correct! Thank you for the feedback -- I have updated the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [10:24:07] (03CR) 10CI reject: [V:04-1] kubernetes: remove master usage of prometheus_all_nodes, Prometheus has access by default [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [10:31:06] (03PS1) 10Btullis: Update dumpsgen SSH settings on clouddumps servers. [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) [10:31:56] (03PS2) 10Btullis: Update dumpsgen SSH settings on clouddumps servers. [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) [10:33:13] (03CR) 10Hnowlan: [C:03+1] mw:periodic_job:kubernetes: quote job description [puppet] - 10https://gerrit.wikimedia.org/r/1140548 (owner: 10Scott French) [10:34:57] (03PS3) 10Btullis: Update dumpsgen SSH settings on clouddumps servers. [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) [10:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:37] (03PS4) 10Btullis: Update dumpsgen SSH settings on clouddumps servers. [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) [10:37:51] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5436/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [10:40:57] (03PS3) 10Filippo Giunchedi: kubernetes: remove usage of prometheus_all_nodes, Prometheus has access by default [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) [10:41:10] (03PS5) 10Btullis: Update dumpsgen SSH settings on clouddumps servers. [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) [10:42:25] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5437/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [10:43:25] (03CR) 10CI reject: [V:04-1] kubernetes: remove usage of prometheus_all_nodes, Prometheus has access by default [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [10:51:29] (03PS4) 10Filippo Giunchedi: kubernetes: remove usage of prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) [10:51:36] (03CR) 10Btullis: [C:03+1] Readd an-worker1166-68 to cluster [puppet] - 10https://gerrit.wikimedia.org/r/1140559 (https://phabricator.wikimedia.org/T390170) (owner: 10Stevemunene) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250502T0700) [11:00:05] jelto, arnoldokoth, and mutante: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab version upgrades . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250502T1100). [11:01:47] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [11:02:29] (03CR) 10Btullis: [V:03+1 C:03+2] Update dumpsgen SSH settings on clouddumps servers. [puppet] - 10https://gerrit.wikimedia.org/r/1140664 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [11:13:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:14] (03PS1) 10Hnowlan: mw::maintenance: migrate one image suggestions job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140671 (https://phabricator.wikimedia.org/T388537) [11:23:16] (03PS1) 10Hnowlan: mw::maintenance: migrate all image suggestions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) [11:24:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10785702 (10fnegri) 05Open→03Resolved Thanks @Jclark-ctr, looking good: {F59619503} I'll mark this as Resolved, and let you know... [11:27:17] FIRING: [149x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:25] FIRING: SystemdUnitFailed: wmf_auto_restart_uwsgi-netbox.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:51] (03PS1) 10Stevemunene: hdfs: remove an-worker116[6-8] from hadoop worker role [puppet] - 10https://gerrit.wikimedia.org/r/1140677 (https://phabricator.wikimedia.org/T390170) [11:41:50] (03PS1) 10Btullis: Fix dumpsgen authorized_keys and remove chrootdirectory [puppet] - 10https://gerrit.wikimedia.org/r/1140678 (https://phabricator.wikimedia.org/T390738) [11:56:55] (03CR) 10Stevemunene: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1140678 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [11:57:46] (03CR) 10Kamila Součková: mw::maintenance: migrate all image suggestions jobs to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [11:58:49] (03CR) 10Kamila Součková: mw::maintenance: migrate one image suggestions job to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140671 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [12:01:27] FIRING: ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:06:27] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:19] (03CR) 10Jforrester: [C:03+1] tests: Move buildLogoHTML.php to tests/ alongside buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140658 (owner: 10Krinkle) [12:07:58] (03CR) 10Jforrester: [C:03+1] "Aha, thank you, was being irritated by these locally. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140652 (owner: 10Krinkle) [12:08:41] (03CR) 10Jforrester: "Maybe getConfigGlobals should be replaced instead, as this is the original version? Eh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [12:10:12] (03PS1) 10Ayounsi: gNMIc: collect optics status on Juniper [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) [12:11:54] (03CR) 10Krinkle: "Aye, I considered that, but since T169821 we no longer actually have a "cache" for it to "generate". In the next patch this will make more" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [12:13:15] (03CR) 10Ayounsi: "As data point, it's ~400 metrics on cr4-ulsfo (for 10 interfaces), ~600 without the `event-delete` transform." [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:13:55] (03PS2) 10Ayounsi: gNMIc: collect optics status on Juniper [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) [12:14:37] (03CR) 10Jforrester: "The file is literally called MWConfigCacheGenerator. :-) But fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [12:16:12] (03PS1) 10Ayounsi: Fastnetmon bump threshold_mbps to 8Gbps [puppet] - 10https://gerrit.wikimedia.org/r/1140689 (https://phabricator.wikimedia.org/T311005) [12:16:24] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:16:52] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140689 (https://phabricator.wikimedia.org/T311005) (owner: 10Ayounsi) [12:36:37] (03CR) 10Dreamy Jazz: mw::maintenance: migrate fixGlobalBlockWhitelist to k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [12:36:50] (03PS1) 10Majavah: P:docker::builder: Build a Trixie image [puppet] - 10https://gerrit.wikimedia.org/r/1140695 (https://phabricator.wikimedia.org/T393173) [12:38:20] (03CR) 10Dreamy Jazz: mw::maintenance: migrate fixGlobalBlockWhitelist to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [12:38:27] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#10785921 (10Marostegui) @Volans - can we resume this work? [12:43:24] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5438/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140695 (https://phabricator.wikimedia.org/T393173) (owner: 10Majavah) [12:47:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2006.codfw.wmnet [12:51:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2006.codfw.wmnet [12:52:55] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10785955 (10Andrew) 05Open→03Resolved I'm satisfied that https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249 is an... [12:58:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-eqiad - https://phabricator.wikimedia.org/T393053#10785966 (10akosiaris) [12:59:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-codfw - https://phabricator.wikimedia.org/T393054#10785969 (10akosiaris) [13:00:19] (03CR) 10Ayounsi: netbox: add fetch_device_interfaces using GraphQL (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [13:02:22] PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:03:36] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:56] (03PS1) 10Alexandros Kosiaris: site.pp changes for aux-k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/1140701 (https://phabricator.wikimedia.org/T393053) [13:04:53] (03PS1) 10Zabe: Force priviledged users to not use a common one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140702 [13:05:04] (03PS1) 10Abijeet Patro: Mobile frequent languages entrypoint: Add dependency to sitemapper [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140703 (https://phabricator.wikimedia.org/T393144) [13:06:26] (03PS2) 10Zabe: Force priviledged users to not use a common password [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140702 [13:07:12] RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:07:14] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:11:24] (03PS3) 10Zabe: Enforce password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140702 [13:12:23] (03CR) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [13:12:30] (03PS11) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [13:13:21] (03PS1) 10Stevemunene: hdfs: onboard an-worker116[6-8] after setup [puppet] - 10https://gerrit.wikimedia.org/r/1140704 (https://phabricator.wikimedia.org/T390170) [13:17:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2007.codfw.wmnet [13:18:11] (03PS4) 10Krinkle: multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) [13:18:11] (03PS3) 10Krinkle: tests: Move buildLogoHTML.php to tests/ alongside buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140658 [13:18:11] (03PS1) 10Krinkle: multiversion: Separate wmf-config reading from actual Multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140707 (https://phabricator.wikimedia.org/T169821) [13:19:04] (03CR) 10CI reject: [V:04-1] multiversion: Separate wmf-config reading from actual Multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140707 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [13:20:09] (03PS2) 10Krinkle: multiversion: Separate wmf-config reading from actual Multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140707 (https://phabricator.wikimedia.org/T169821) [13:21:06] (03CR) 10CI reject: [V:04-1] multiversion: Separate wmf-config reading from actual Multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140707 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [13:21:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2007.codfw.wmnet [13:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:24:30] James_F: ! i hae not looked in the mw config repo in a while, hjow do I see what wikis ArticlePlaceholder is enabled on these days? [13:24:39] https://github.com/search?q=repo%3Awikimedia%2Foperations-mediawiki-config%20ArticlePlaceholder&type=code isnt helping! but I was expecting sometihng to appear there [13:25:38] !log invoked manual `garbagecollect`, Cassandra sessionstore — T390514 [13:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:50] (03CR) 10Btullis: [C:03+2] Fix dumpsgen authorized_keys and remove chrootdirectory [puppet] - 10https://gerrit.wikimedia.org/r/1140678 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [13:27:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001#10786078 (10Papaul) on lsw1-f1-codfw we are having this error ` ERROR: There is pending upgrade. upgrade_in_progress=appupgrade_stage ERROR:... [13:30:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001#10786092 (10ayounsi) Thx, I tried this, let's see if it helps: ` lsw1-f1-eqiad> request system software rollback localre: ----------... [13:31:31] (03PS4) 10Ayounsi: Account for non defined dict keys [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 [13:37:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2008.wikimedia.org [13:40:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2008.wikimedia.org [13:45:16] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140711 [13:46:53] addshore: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/InitialiseSettings.php#9527 — bnwiki, cywiki, dagwiki, eowiki, etwiki, guwiki, htwiki, knwiki, kswiki, lvwiki, napwiki, nnwiki, orwiki, papwiki, sewiki, sqwiki, urwiki, plus some test wikis [13:48:03] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10786135 (10MatthewVernon) [to be clear I won't delete things from swift without a common... [13:49:04] (03CR) 10Jforrester: Enforce password policy (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140702 (owner: 10Zabe) [13:49:38] !log imported ruby-defaults 1:3.3~wmf13u1 to component/puppet7 for trixie-wikimedia T392790 [13:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:41] T392790: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790 [13:56:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10786159 (10Stevemunene) [13:58:09] (03PS1) 10Muehlenhoff: Extend package list to be installed from component/puppet7 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140716 (https://phabricator.wikimedia.org/T392790) [13:58:17] 10SRE-swift-storage, 06Commons: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10786173 (10MatthewVernon) So, the first isn't in either swift cluster per `swift stat wikipedia-commons-local-pu... [14:01:50] (03CR) 10Btullis: [C:03+1] hdfs: remove an-worker116[6-8] from hadoop worker role [puppet] - 10https://gerrit.wikimedia.org/r/1140677 (https://phabricator.wikimedia.org/T390170) (owner: 10Stevemunene) [14:04:00] 10SRE-swift-storage, 06Commons: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10786196 (10MatthewVernon) Second isn't in either swift cluster per `swift stat wikipedia-commons-local-public.21... [14:06:14] (03PS3) 10Krinkle: multiversion: Separate wmf-config reading from actual Multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140707 (https://phabricator.wikimedia.org/T169821) [14:07:17] (03PS4) 10Krinkle: multiversion: Separate wmf-config reading from actual Multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140707 (https://phabricator.wikimedia.org/T169821) [14:09:02] 10SRE-swift-storage, 06Commons: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10786205 (10MatthewVernon) The third isn't in either swift cluster per `swift stat wikipedia-commons-local-public... [14:10:59] !log bking@localhost set search_codfw num_concurrent_incoming_recoveries from 20 back down to 4 after migration T391350 [14:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:02] T391350: Review/rollback all temporary changes from the OpenSearch migration - https://phabricator.wikimedia.org/T391350 [14:13:24] 10SRE-swift-storage, 06Commons: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10786212 (10MatthewVernon) The fourth is also not in either swift cluster per `swift stat wikipedia-commons-local... [14:15:15] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10786214 (10jcrespo) [14:16:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:54] (03PS1) 10Bking: relforge: remove config prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1140717 (https://phabricator.wikimedia.org/T390565) [14:20:16] (03PS4) 10Zabe: Enforce password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140702 [14:20:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140717 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [14:21:09] (03CR) 10Stevemunene: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1140537 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:21:39] * Krinkle editing/testing something ad-hoc onmwdebug1002.eqiad.wmnet [14:22:58] (03PS2) 10Bking: relforge: remove config prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1140717 (https://phabricator.wikimedia.org/T390565) [14:23:09] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140717 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [14:23:45] (03CR) 10Zabe: Enforce password policy (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140702 (owner: 10Zabe) [14:23:51] (03PS1) 10Ssingh: wikimedia-dns.org: bump up TTL for TYPE65 record (60 to 600) [dns] - 10https://gerrit.wikimedia.org/r/1140719 [14:26:22] (03PS12) 10Ayounsi: wmf-netbox use core Homer GraphQL based fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [14:30:33] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM from a Prometheus' POV, thank you for mentioning the expected number of metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:31:12] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10786232 (10MoritzMuehlenhoff) [14:31:37] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10786236 (10Eevans) >>! In T391903#10784611, @Jclark-ctr wrote: > Let me know what you would like to do i can remove drive you can reboot > > ` > Server shows 8 drives > > [ ... ] > ` That's weird; I wond... [14:32:42] (03PS6) 10Ayounsi: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 [14:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:39] (03PS1) 10Stevemunene: superset-next: Adjust superset csp to allow image uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140720 (https://phabricator.wikimedia.org/T391692) [14:46:39] !log dancy@deploy1003 Installing scap version "4.159.0" for 2 host(s) [14:46:43] (03CR) 10Stevemunene: [C:03+2] hdfs: remove an-worker116[6-8] from hadoop worker role [puppet] - 10https://gerrit.wikimedia.org/r/1140677 (https://phabricator.wikimedia.org/T390170) (owner: 10Stevemunene) [14:48:28] !log dancy@deploy1003 Installation of scap version "4.159.0" completed for 2 hosts [14:49:31] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1140723 [14:54:58] (03CR) 10Hashar: [C:03+1] "Looks, good thank you very much :)" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [14:55:37] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1166.eqiad.wmnet [14:56:03] (03CR) 10Bking: [C:03+2] cirrussearch: add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1140537 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:57:19] (03CR) 10Scott French: "Great, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [14:58:35] (03PS1) 10Máté Szabó: Set wgPHPSessionHandling to 'warn' on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) [14:58:43] (03PS9) 10FNegri: add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [14:59:41] PROBLEM - SSH on prometheus3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:59:57] PROBLEM - SSH on prometheus7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:00:20] (03CR) 10Nik Gkountas: Catalog ContentTranslation tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [15:00:20] (03PS4) 10Nik Gkountas: Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) [15:00:31] RECOVERY - SSH on prometheus3003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:01:26] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1166.eqiad.wmnet [15:01:28] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2076.codfw.wmnet|cirrussearch2080.codfw.wmnet|cirrussearch2081.codfw.wmnet|cirrussearch2083.codfw.wmnet|cirrussearch2084.codfw.wmnet|cirrussearch2092.codfw.wmnet|cirrussearch2093.codfw.wmnet|cirrussearch2100.codfw.wmnet|cirrussearch2106.codfw.wmnet|cirrussearch2108.codfw.wmnet [15:01:51] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1167.eqiad.wmnet [15:02:38] (03PS1) 10Herron: add dummy group for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/1140727 [15:02:42] FIRING: [8x] ProbeDown: Service prometheus1007:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:29] (03CR) 10Herron: [V:03+2 C:03+2] add dummy group for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/1140727 (owner: 10Herron) [15:03:35] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1167.eqiad.wmnet [15:03:41] PROBLEM - SSH on prometheus2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:03:43] FIRING: [11x] ProbeDown: Service prometheus1007:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:55] PROBLEM - SSH on prometheus2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:03:56] this doesn't look good [15:04:27] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1168.eqiad.wmnet [15:05:41] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:06:03] sukhe: I'll take a look [15:06:09] thanks godog [15:06:10] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1168.eqiad.wmnet [15:06:49] RECOVERY - SSH on prometheus7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:06:58] FIRING: [3x] SLOMetricAbsent: citoid-latency eqiad - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:07:08] (03CR) 10Scott French: [C:03+1] "TIL `migration_title` is a thing! Neat :)" [puppet] - 10https://gerrit.wikimedia.org/r/1139811 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [15:07:09] (03CR) 10Ayounsi: "This is ready for review and tested quite a bit !" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [15:07:11] yeah looks like at least prometheus@k8s on prometheus200[78] asploded in memory, cc sukhe herron [15:07:38] good times [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:42] FIRING: [12x] ProbeDown: Service prometheus1007:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:07:51] FIRING: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:07:53] godog: ah, note 7001 also probably? [15:07:57] that's the magru one [15:08:29] indeed [15:08:44] godog: are you bouncing the hosts? I have an extremely slow login console on prom2007 fwiw [15:09:15] herron: ok since you are in console already please do bounce the hosts [15:09:26] ok doing [15:09:57] PROBLEM - SSH on prometheus7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:10:00] thanks folks! [15:10:41] sure np, I'll look into 7001 [15:11:21] !log power cycling prometheus200[78] via rac [15:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:37] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:11:47] RECOVERY - SSH on prometheus7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:11:53] PROBLEM - Host prometheus2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:58] FIRING: [6x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:12:00] power cycle has been issued to both waiting for them to reboot [15:12:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10786409 (10Stevemunene) Hosts all successfully ran the init cookook and `fstab` was righfully pop... [15:12:42] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:45] FIRING: [2x] SLOMetricAbsent: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:14:22] herron godog sukhe we're working on https://phabricator.wikimedia.org/T393177 which (hopefully) should keep SSH responsive even when the rest of the machine is falling over. Not sure if it would help y'all's situation but feel free to subscribe if interested [15:14:27] PROBLEM - Host prometheus2007 is DOWN: PING CRITICAL - Packet loss = 100% [15:14:37] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:31] RECOVERY - SSH on prometheus2007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:15:33] RECOVERY - Host prometheus2007 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [15:15:39] inflatador: very nice! I've subscribed [15:15:41] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:15:47] RECOVERY - SSH on prometheus2008 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:15:49] RECOVERY - Host prometheus2008 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [15:16:21] inflatador: interesting thank you [15:16:22] inflatador: interesting! [15:16:40] (03CR) 10Gergő Tisza: [C:03+1] Set wgPHPSessionHandling to 'warn' on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [15:16:56] prometheus7001 came back on its own btw [15:17:10] yeah, definitely read that whole facebook article if you have time, really good practical explanation of cgroupsv2 [15:17:42] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:42] FIRING: [12x] ProbeDown: Service prometheus1007:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:17:45] FIRING: [2x] SLOMetricAbsent: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:20:49] (03PS1) 10Andrew Bogott: site.pp entry for cloudcontrol2010-dev [puppet] - 10https://gerrit.wikimedia.org/r/1140739 (https://phabricator.wikimedia.org/T393102) [15:22:45] RESOLVED: [2x] SLOMetricAbsent: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:23:21] (03CR) 10Andrew Bogott: [C:03+2] site.pp entry for cloudcontrol2010-dev [puppet] - 10https://gerrit.wikimedia.org/r/1140739 (https://phabricator.wikimedia.org/T393102) (owner: 10Andrew Bogott) [15:23:57] (03PS1) 10Ebernhardson: opensearch: Provide expected base_data_dir to readahead disable [puppet] - 10https://gerrit.wikimedia.org/r/1140741 [15:24:36] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10786440 (10Andrew) a:05Andrew→03None I believe this host has a raid controller. Assuming that's correct, I'd like a raid10 combin... [15:24:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140741 (owner: 10Ebernhardson) [15:26:58] RESOLVED: [6x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:28:03] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10786461 (10MoritzMuehlenhoff) [15:28:06] (03PS2) 10Bking: opensearch: Provide expected base_data_dir to readahead disable [puppet] - 10https://gerrit.wikimedia.org/r/1140741 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [15:28:21] (03CR) 10Bking: [C:03+2] opensearch: Provide expected base_data_dir to readahead disable [puppet] - 10https://gerrit.wikimedia.org/r/1140741 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [15:28:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:51] FWIW re: prometheus those were some heavy queries from grafana, what looked like manual queries [15:29:58] i.e. not from a dashboard [15:30:11] (03PS1) 10Majavah: libraryupgrader: Update branch name [puppet] - 10https://gerrit.wikimedia.org/r/1140744 [15:30:26] (03CR) 10Btullis: [C:03+1] hdfs: onboard an-worker116[6-8] after setup [puppet] - 10https://gerrit.wikimedia.org/r/1140704 (https://phabricator.wikimedia.org/T390170) (owner: 10Stevemunene) [15:32:44] (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate aggregateparticipantanswers-testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1139811 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [15:33:25] FIRING: SystemdUnitFailed: wmf_auto_restart_uwsgi-netbox.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:46] (03CR) 10Stevemunene: [C:03+2] hdfs: onboard an-worker116[6-8] after setup [puppet] - 10https://gerrit.wikimedia.org/r/1140704 (https://phabricator.wikimedia.org/T390170) (owner: 10Stevemunene) [15:37:17] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:51] (03PS1) 10Robertsky: siwikitionary: update logo to localised svg version. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) [15:41:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install (4) aux-k8 in ops-codfw - https://phabricator.wikimedia.org/T393054#10786517 (10RobH) [15:42:17] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:37] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10786521 (10RobH) [15:42:57] (03PS1) 10Bking: relforge: bring new host into production [puppet] - 10https://gerrit.wikimedia.org/r/1140749 (https://phabricator.wikimedia.org/T393190) [15:43:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10786523 (10RobH) [15:43:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:43] James_F: many thanks, I think GitHub just doesn't index that file as it's too big :0 [15:44:07] (03PS2) 10Bking: relforge: bring new host into production [puppet] - 10https://gerrit.wikimedia.org/r/1140749 (https://phabricator.wikimedia.org/T393190) [15:44:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140749 (https://phabricator.wikimedia.org/T393190) (owner: 10Bking) [15:44:30] (03CR) 10Stevemunene: [C:03+2] Readd an-worker1166-68 to cluster [puppet] - 10https://gerrit.wikimedia.org/r/1140559 (https://phabricator.wikimedia.org/T390170) (owner: 10Stevemunene) [15:45:12] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1166.eqiad.wmnet [15:45:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10786537 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting afte... [15:47:17] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:52:17] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:17] FIRING: [144x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:38] (03PS1) 10MVernon: swift: add ms-fe101[5,6] as new proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1140752 (https://phabricator.wikimedia.org/T388886) [15:58:43] FIRING: [135x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:13] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195 (10RobH) 03NEW [15:59:21] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10786590 (10RobH) [15:59:46] 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10786592 (10MatthewVernon) [15:59:55] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10786594 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. T... [16:00:09] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686#10786598 (10Andrew) + @cmooney because I bet he can fix this in 5 seconds [16:02:17] FIRING: [115x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:43] FIRING: [113x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:53] (03CR) 10Jcrespo: [C:03+1] swift: add ms-fe101[5,6] as new proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1140752 (https://phabricator.wikimedia.org/T388886) (owner: 10MVernon) [16:05:02] PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:06:27] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:07:17] FIRING: [90x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:23] (03CR) 10JHathaway: [C:03+1] Extend package list to be installed from component/puppet7 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140716 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [16:17:44] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:23:32] PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:23:44] PROBLEM - Hadoop NodeManager on an-worker1206 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:24:02] RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:24:16] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-f1-codfw.mgmt.codfw.wmnet [16:24:18] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:24:38] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:25:44] RECOVERY - Hadoop NodeManager on an-worker1206 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:26:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:28:19] !log mvernon@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 18:00:00 on ms-fe1015.eqiad.wmnet with reason: not yet in prod [16:28:29] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10786701 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=60e63b14-6e88-4b5b-aa82-3177a7ab590b) set by mvernon@cumin1002 for 2 days,... [16:28:38] !log mvernon@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 18:00:00 on ms-fe1016.eqiad.wmnet with reason: not yet in prod [16:28:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10786703 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aed9f089-6890-4797-9578-4420d20fd11c) set by mvernon@cumin1002 for 2 days,... [16:29:39] (03PS2) 10Andrea Denisse: grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) [16:30:35] (03PS3) 10Andrea Denisse: grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) [16:30:44] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:12] PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:33:07] (03CR) 10Andrea Denisse: "Thanks for taking a look, now that I think more about it removing it from the catalog won't disable synchronization as that's already depl" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [16:36:12] RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:38] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:56] PROBLEM - Hadoop NodeManager on an-worker1207 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:40:38] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:38] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:24] (03CR) 10JHathaway: [C:03+2] ferm: ignore hidden staged files created by confd [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway) [16:47:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-f1-codfw.mgmt.codfw.wmnet [16:48:17] (03CR) 10BCornwall: [C:03+1] wikimedia-dns.org: bump up TTL for TYPE65 record (60 to 600) [dns] - 10https://gerrit.wikimedia.org/r/1140719 (owner: 10Ssingh) [16:48:32] RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:56] RECOVERY - Hadoop NodeManager on an-worker1207 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:04] (03CR) 10Bking: [C:03+2] relforge: bring new host into production [puppet] - 10https://gerrit.wikimedia.org/r/1140749 (https://phabricator.wikimedia.org/T393190) (owner: 10Bking) [16:51:16] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: bump up TTL for TYPE65 record (60 to 600) [dns] - 10https://gerrit.wikimedia.org/r/1140719 (owner: 10Ssingh) [16:51:16] (03CR) 10Bking: [C:03+2] "self-merging, as this is a non-prod environment" [puppet] - 10https://gerrit.wikimedia.org/r/1140749 (https://phabricator.wikimedia.org/T393190) (owner: 10Bking) [16:51:26] !log sukhe@dns1004 START - running authdns-update [16:53:20] (03CR) 10Ssingh: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1140262 (owner: 10Ncmonitor) [16:53:58] !log sukhe@dns1004 END - running authdns-update [16:56:27] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:57:24] 06SRE, 06Infrastructure-Foundations, 10netops: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10786790 (10jhathaway) We packaged `prometheus-ethtool-exporter` ourselves, so removing it from our stack would also remove a small maintenance burden, e.g. we haven't packaged i... [16:57:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:00:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:02:17] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:43] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:07:42] RESOLVED: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:07:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:09:59] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1166.eqiad.wmnet [17:10:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10786850 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting after harddrives upgrade [17:10:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:13:43] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:41] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1166.eqiad.wmnet [17:19:04] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1167.eqiad.wmnet [17:19:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10786875 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting after harddrives upgrade [17:21:30] (03PS1) 10Xcollazo: Default smtp to localhost for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) [17:22:04] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo) [17:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:23:38] (03CR) 10CI reject: [V:04-1] Default smtp to localhost for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo) [17:26:45] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1167.eqiad.wmnet [17:27:12] (03PS2) 10Xcollazo: Default smtp to localhost for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) [17:27:21] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo) [17:27:28] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1168.eqiad.wmnet [17:27:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10786915 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting after harddrives upgrade [17:28:43] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:31:51] (03PS1) 10Andrew Bogott: codfw1dev backups: enforce_policy_scope: true [puppet] - 10https://gerrit.wikimedia.org/r/1140767 (https://phabricator.wikimedia.org/T330759) [17:31:53] (03PS1) 10Andrew Bogott: eqiad1: enforce_new_policy_defaults: True [puppet] - 10https://gerrit.wikimedia.org/r/1140768 (https://phabricator.wikimedia.org/T330759) [17:35:24] (03CR) 10Majavah: [C:03+2] libraryupgrader: Update branch name [puppet] - 10https://gerrit.wikimedia.org/r/1140744 (owner: 10Majavah) [17:35:36] (03CR) 10Xcollazo: "PPC looks good to me, but I now noticed that there are `analytics` and `main` versions of all refine scripts, so I wonder if setting this " [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo) [17:35:38] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1168.eqiad.wmnet [17:36:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:37:00] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:37:54] (03CR) 10Bernard Wang: Stream registration for article summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [17:38:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10786941 (10Stevemunene) Hosts are onboarded and visible from the datanode interface {F59627085} [17:39:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10786942 (10Stevemunene) [17:39:44] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10786946 (10Stevemunene) [17:40:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10786949 (10Stevemunene) a:03Stevemunene [17:41:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:42:26] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev backups: enforce_policy_scope: true [puppet] - 10https://gerrit.wikimedia.org/r/1140767 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [17:47:13] (03PS2) 10Scott French: hieradata: switch mw-script main release to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1137496 (https://phabricator.wikimedia.org/T391057) [17:47:13] (03CR) 10Scott French: "Thanks in advance for the review! I don't intend to merge these two patches until the target date next week (or later, if issues arise), s" [puppet] - 10https://gerrit.wikimedia.org/r/1137496 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [17:47:16] (03PS3) 10Scott French: deployment_server: drop unsupported fallback to PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/1137497 (https://phabricator.wikimedia.org/T391057) [17:51:10] (03PS1) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [17:51:38] (03CR) 10CI reject: [V:04-1] cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [17:52:07] (03PS2) 10Andrew Bogott: eqiad1: enforce_new_policy_defaults: True [puppet] - 10https://gerrit.wikimedia.org/r/1140768 (https://phabricator.wikimedia.org/T330759) [17:52:11] (03PS2) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [17:52:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140768 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [17:52:23] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [17:53:20] (03CR) 10CI reject: [V:04-1] cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [17:53:27] (03PS1) 10Andrea Denisse: grafana: Toggle data sync using feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) [17:53:27] (03CR) 10Andrea Denisse: "Once the testing period is over and both hosts are using the same Grafana version I'll send a patch to re-enable sync on the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [17:53:28] PROBLEM - Hadoop NodeManager on an-worker1173 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:54:00] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:54:06] PROBLEM - Hadoop NodeManager on an-worker1165 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:55:45] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205 (10ops-monitoring-bot) 03NEW [17:57:06] RECOVERY - Hadoop NodeManager on an-worker1165 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:03:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:06:06] (03CR) 10Jforrester: [C:03+1] "Let's not deploy this during the Hackathon, but great stuff." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [18:13:28] RECOVERY - Hadoop NodeManager on an-worker1173 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:18:20] (03PS1) 10Dwisehaupt: icinga: frack: adjust fran* groupings and add host [puppet] - 10https://gerrit.wikimedia.org/r/1140775 (https://phabricator.wikimedia.org/T386259) [18:22:39] (03PS3) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [18:23:08] (03CR) 10CI reject: [V:04-1] cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:28:44] (03PS4) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [18:32:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:35:15] (03PS5) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [18:35:43] (03CR) 10CI reject: [V:04-1] cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:38:08] (03PS6) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [18:38:39] (03CR) 10CI reject: [V:04-1] cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:40:11] (03PS7) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [18:43:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:46:12] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:47:25] (03CR) 10Dzahn: "re: "backing up gerrit only just 1 dc". That is the status quo and we considered that smarter. But in the recent incident we ended up in a" [puppet] - 10https://gerrit.wikimedia.org/r/1140506 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [18:47:41] (03PS8) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [18:48:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:03:00] RECOVERY - Disk space on analytics1071 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1071&var-datasource=eqiad+prometheus/ops [19:08:42] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:08:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:09:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:12:46] (03PS2) 10Herron: logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) [19:12:46] (03CR) 10Herron: [V:03+1] "here's something to get the ball rolling on this one" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [19:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_uwsgi-netbox.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:28:42] (03CR) 10Cwhite: logs-api: add write/delete acl via htgroup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [19:32:07] (03CR) 10Herron: [V:03+1] logs-api: add write/delete acl via htgroup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [19:35:28] (03PS1) 10Ryan Kemper: cirrus: remove old elastic hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1140778 (https://phabricator.wikimedia.org/T388610) [19:36:26] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:38:37] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:41:35] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [19:41:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10787207 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [19:42:25] (03CR) 10Andrew Bogott: "Just so I'm caught up (again) -- the issue here is that a cloudweb on a private network can't talk to proxy-eqiad1.wmflabs.org, but it /ca" [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453) (owner: 10Majavah) [19:44:13] (03CR) 10JHathaway: [C:03+1] Initial Puppet agent apt config for Puppet 7 in Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140659 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [19:44:45] (03CR) 10Andrew Bogott: [C:03+1] "lgtm. Any trove backup things happening are probably somewhat redundant but I won't complain about extra backups." [puppet] - 10https://gerrit.wikimedia.org/r/1138241 (owner: 10Majavah) [19:45:58] (03CR) 10Andrew Bogott: [C:03+1] toolforge: toolviews: Drop support for tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1121630 (owner: 10Majavah) [19:47:00] (03CR) 10Andrew Bogott: [C:03+1] realm: stop setting labsproject [puppet] - 10https://gerrit.wikimedia.org/r/916425 (owner: 10Majavah) [19:50:43] (03CR) 10RLazarus: [C:03+1] hieradata: switch mw-script main release to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1137496 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [19:50:48] (03CR) 10RLazarus: [C:03+1] deployment_server: drop unsupported fallback to PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/1137497 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [19:50:49] (03CR) 10Jgreen: [C:03+1] icinga: frack: adjust fran* groupings and add host [puppet] - 10https://gerrit.wikimedia.org/r/1140775 (https://phabricator.wikimedia.org/T386259) (owner: 10Dwisehaupt) [19:54:55] 10ops-esams, 06DC-Ops: Inbound errors on interface cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://phabricator.wikimedia.org/T393213 (10phaultfinder) 03NEW [19:55:36] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53942 bytes in 2.991 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:55:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:56:00] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:57:43] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:58:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:58:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:09:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:09:34] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:11:13] !log removed 1 file for legal compliance [20:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:04] (03PS1) 10Dr0ptp4kt: Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) [20:14:38] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:15:44] !log removed 1 file for legal compliance [20:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:17:25] (03CR) 10Dr0ptp4kt: "@tchin@wikimedia.org is updating EventGate Wikimedia for https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-c64ca5jhxmnc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [20:18:39] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm [20:18:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10787300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm [20:20:14] (03PS2) 10Dr0ptp4kt: Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) [20:23:16] !log removed 3 files for legal compliance [20:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:59] (03PS1) 10Dwisehaupt: frack: update A and PTR records for NAT mappings [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) [20:24:36] (03CR) 10CI reject: [V:04-1] frack: update A and PTR records for NAT mappings [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) (owner: 10Dwisehaupt) [20:27:38] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [20:27:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10787335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed... [20:29:56] (03PS3) 10Dr0ptp4kt: Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) [20:29:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:30:39] (03CR) 10Dwisehaupt: [C:04-1] "ends up there are things scattered in this file. going to rethink and rearrange a bit." [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) (owner: 10Dwisehaupt) [20:31:44] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:33:34] (03PS4) 10Dr0ptp4kt: Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) [20:34:05] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [20:34:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10787355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [20:53:02] (03CR) 10Bking: [C:03+2] cirrus: remove old elastic hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1140778 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [20:58:49] (03PS5) 10Dr0ptp4kt: Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) [21:03:10] (03PS3) 10Bking: relforge: remove config prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1140717 (https://phabricator.wikimedia.org/T390565) [21:05:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10787399 (10VRiley-WMF) Currently attepting to image these servers, however, it seems it's not being detected after reboot. Will continue to investigate th... [21:12:16] (03PS2) 10Dwisehaupt: frack: update A and PTR records for NAT mappings [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) [21:13:42] (03PS2) 10Dwisehaupt: frack: update A and PTR records for NAT mappings [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) [21:13:42] (03CR) 10Dwisehaupt: "@jgreen@wikimedia.org Here is the change with the rearrangement. This pulls all of our A records to the one location. Also it brings the C" [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) (owner: 10Dwisehaupt) [21:14:55] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): sessionstorage namespacing - https://phabricator.wikimedia.org/T392170#10787411 (10Eevans) [21:18:58] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:22:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:22:21] vriley@cumin1002 reimage (PID 1772954) is awaiting input [21:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:23:03] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [21:23:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10787444 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed... [21:25:47] (03CR) 10Jgreen: [C:03+1] frack: update A and PTR records for NAT mappings [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) (owner: 10Dwisehaupt) [21:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:54] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1054.eqiad.wmnet with OS bookworm [21:39:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10787485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed... [21:40:53] 06SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10787491 (10Scott_French) Following up on how this alerting might evolve, there was some discussion in T392989 about how to mak... [21:52:41] (03PS1) 10JHathaway: postfix: add support for cfssl certs [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715) [21:53:40] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [22:09:58] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:10:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 8.142 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:12:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.801 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:29:27] (03PS9) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) [22:30:20] (03CR) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [22:32:13] (03PS1) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1140795 [22:32:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140795 (owner: 10JHathaway) [22:37:41] (03PS2) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1140795 [22:37:48] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140795 (owner: 10JHathaway) [22:39:35] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140795 (owner: 10JHathaway) [22:41:49] (03CR) 10Dwisehaupt: [C:03+1] "This looks good to me. I like the selection as a function since it keeps it clean. The PCC error is unrelated to the change, so not 100% s" [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [22:55:39] (03PS4) 10Novem Linguae: core-Permissions: refactor enwiki wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140661 [23:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:40:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140797 [23:40:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140797 (owner: 10TrainBranchBot) [23:52:48] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140797 (owner: 10TrainBranchBot)