[00:33:17] 10SRE, 10Observability-Alerting, 10Traffic: Use DNS name instead of IP in PyBal alerts - https://phabricator.wikimedia.org/T322377 (10BCornwall) 05Open→03In progress a:03BCornwall [00:33:40] 10SRE, 10Observability-Alerting, 10Traffic: Use DNS name instead of IP in PyBal alerts - https://phabricator.wikimedia.org/T322377 (10BCornwall) p:05Triage→03Low [00:39:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912392 [00:39:24] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912392 (owner: 10TrainBranchBot) [00:40:25] (03PS1) 10BCornwall: pybal: Send service hostnames on alert [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) [00:42:43] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40946/console" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [00:43:01] 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review: Use DNS name instead of IP in PyBal alerts - https://phabricator.wikimedia.org/T322377 (10BCornwall) @bking: My patch still keeps the IP addresses around since I feel that some information is better than no information in the case of DNS looku... [00:46:29] (03CR) 10BCornwall: [V: 03+1] "This was tested on my own copy on an lvs server by hardcoding different sets and running with ./check_pybal_ipvs_diff --req-timeout=2.0 --" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [00:49:00] (03PS1) 10BryanDavis: rebuild_all: clean before building too [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/913005 [00:49:42] (03CR) 10BryanDavis: [C: 03+2] rebuild_all: clean before building too [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/913005 (owner: 10BryanDavis) [00:50:16] (03Merged) 10jenkins-bot: rebuild_all: clean before building too [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/913005 (owner: 10BryanDavis) [00:52:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [00:57:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912392 (owner: 10TrainBranchBot) [02:09:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [02:24:35] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:29] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:52] (03PS2) 10Catrope: beta: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913002 (https://phabricator.wikimedia.org/T319064) [02:40:36] (03PS3) 10Catrope: beta: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912417 (https://phabricator.wikimedia.org/T319064) [02:40:37] (03PS3) 10Catrope: beta: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913002 (https://phabricator.wikimedia.org/T319064) [02:40:40] (03PS1) 10Catrope: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913018 (https://phabricator.wikimedia.org/T319064) [02:40:44] (03PS1) 10Catrope: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913019 (https://phabricator.wikimedia.org/T319064) [02:41:23] (03CR) 10Catrope: [C: 04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913018 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [02:41:27] (03CR) 10Catrope: [C: 04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913019 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [03:00:22] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:15] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.1% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:13:04] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:38] (03PS4) 10Catrope: beta: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912417 (https://phabricator.wikimedia.org/T319064) [04:50:40] (03PS4) 10Catrope: beta: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913002 (https://phabricator.wikimedia.org/T319064) [04:50:42] (03PS2) 10Catrope: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913018 (https://phabricator.wikimedia.org/T319064) [04:50:44] (03PS2) 10Catrope: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913019 (https://phabricator.wikimedia.org/T319064) [04:52:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:21:59] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) [05:22:29] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) [05:23:19] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) uid: jkieserman uidNumber: 35687 [05:24:28] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) [05:26:24] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) @thcipriani we need your approval for the `restricted` group @JKieserman we need your manager to approve this request. I have also contacted you separately to verify... [05:26:37] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) p:05Triage→03Medium [05:27:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) Thanks Daniel - got it and bookmarked it! :) [05:27:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 393731 [05:28:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 393731 [05:29:55] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 112 [05:30:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 112 [05:47:00] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:50:34] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:50:57] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10ayounsi) My suggestion to use a cookbook is because it's what SREs are familiar with, centralized in one place, can be nested for larger scope automation, provide abstract... [05:52:10] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:56:38] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:37] !log push pfw policies - T335554 [05:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230428T0600) [06:15:45] (03CR) 10Ayounsi: Apply black to all python files (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 (owner: 10Ayounsi) [06:20:07] (03PS11) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [06:23:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [06:27:27] (03CR) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on group 0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [06:27:59] (03PS1) 10Marostegui: db1118: Will be moved to m3 [puppet] - 10https://gerrit.wikimedia.org/r/913029 (https://phabricator.wikimedia.org/T335092) [06:28:25] (03CR) 10Marostegui: [C: 03+2] db1118: Will be moved to m3 [puppet] - 10https://gerrit.wikimedia.org/r/913029 (https://phabricator.wikimedia.org/T335092) (owner: 10Marostegui) [06:46:54] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2001.codfw.wmnet with OS buster [06:47:01] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-cache2001.codfw.wmnet with OS buster [06:51:52] (03CR) 10Klausman: [C: 03+1] ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 (owner: 10Elukey) [06:52:17] (03CR) 10Klausman: [C: 03+1] fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 (owner: 10Elukey) [06:52:59] (03CR) 10Klausman: [C: 03+1] modules: allow istio gateways to have more selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 (owner: 10Elukey) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230428T0700) [07:00:48] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage [07:03:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.053% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:04:05] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage [07:22:09] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2001.codfw.wmnet with OS buster [07:22:16] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-cache2001.codfw.wmnet with OS buster completed: - ml-cache2001 (**PASS**) - Downtimed on Icinga/Alertm... [07:22:19] (03CR) 10Muehlenhoff: "FYI, on Bookworm (with the Puppet 5.5 backport) the following diff gets printed with every second puppet run (maybe it tries to revert and" [puppet] - 10https://gerrit.wikimedia.org/r/912803 (owner: 10Jbond) [07:22:47] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40947/console" [puppet] - 10https://gerrit.wikimedia.org/r/912881 (https://phabricator.wikimedia.org/T335504) (owner: 10EoghanGaffney) [07:23:37] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2002.codfw.wmnet with OS buster [07:23:46] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-cache2002.codfw.wmnet with OS buster [07:25:21] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm for next Tuesday" [puppet] - 10https://gerrit.wikimedia.org/r/912881 (https://phabricator.wikimedia.org/T335504) (owner: 10EoghanGaffney) [07:26:01] (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch: add disable_security_plugin option [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [07:29:04] (03CR) 10Jelto: [C: 03+1] "lgtm for next Tuesday" [dns] - 10https://gerrit.wikimedia.org/r/912972 (https://phabricator.wikimedia.org/T335504) (owner: 10EoghanGaffney) [07:30:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [07:30:36] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [07:33:05] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to trans [07:33:05] CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [07:35:15] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:37:41] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage [07:38:53] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200) https://wiki [07:38:53] imedia.org/wiki/CX [07:39:59] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:41:07] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage [07:44:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [07:47:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [07:53:11] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40948/console" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [07:54:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [07:55:49] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2002.codfw.wmnet with OS buster [07:55:55] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-cache2002.codfw.wmnet with OS buster completed: - ml-cache2002 (**PASS**) - Downtimed on Icinga/Alertm... [07:57:51] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2003.codfw.wmnet with OS buster [07:57:58] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-cache2003.codfw.wmnet with OS buster [08:00:58] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10JMeybohm) That sounds like it would not be blocking me currently from migrating away from tokens (htt... [08:02:52] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [08:07:17] (03CR) 10JMeybohm: "Would you mind splitting this in two CR's with the first one copying 1.0.0 to 1.0.1 and the second one implementing the actual change? Tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 (owner: 10Elukey) [08:11:51] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage [08:12:14] (03CR) 10David Caro: [C: 03+2] toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [08:12:42] (03PS1) 10Filippo Giunchedi: o11y: silence pint errors for missing thanos 'code' label [alerts] - 10https://gerrit.wikimedia.org/r/913107 [08:14:26] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage [08:17:31] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: silence pint errors for missing thanos 'code' label [alerts] - 10https://gerrit.wikimedia.org/r/913107 (owner: 10Filippo Giunchedi) [08:18:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/912979 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [08:19:46] (03PS1) 10Alexandros Kosiaris: machinetranslation: Bump limitranges and resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/913108 (https://phabricator.wikimedia.org/T331505) [08:19:48] (03PS1) 10Alexandros Kosiaris: machinetranslation: Enable thanos-swift service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/913109 (https://phabricator.wikimedia.org/T331505) [08:23:10] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [08:23:18] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest... [08:27:55] (03PS1) 10Filippo Giunchedi: sre: let power supply issues open tasks [alerts] - 10https://gerrit.wikimedia.org/r/913110 (https://phabricator.wikimedia.org/T225140) [08:29:32] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2003.codfw.wmnet with OS buster [08:29:38] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-cache2003.codfw.wmnet with OS buster completed: - ml-cache2003 (**PASS**) - Downtimed on Icinga/Alertm... [08:36:22] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [08:41:21] (03CR) 10Majavah: [C: 04-1] "Do we want to absent anything from the docker base images class?" [puppet] - 10https://gerrit.wikimedia.org/r/911331 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [08:42:04] RECOVERY - Check systemd state on cloudbackup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:34] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10klausman) [08:42:55] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10klausman) 05Open→03Resolved All machines in codfw done. [08:47:04] !log jnuche@deploy1002 Installing scap version "4.51.0" for 593 hosts [08:48:13] (03CR) 10Muehlenhoff: [C: 03+1] "Sorry for the delay, that got a little backlogged. PCC (https://puppet-compiler.wmflabs.org/output/887943/40950/) looks fine, I'll merge t" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [08:49:19] (03PS1) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [08:49:24] (03PS4) 10Elukey: modules: allow istio gateways to have more selectors (part 1) [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 [08:49:26] (03PS4) 10Elukey: fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 [08:49:28] (03PS4) 10Elukey: ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 [08:49:30] (03PS1) 10Elukey: modules: allow istio gateways to have more selectors (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/913115 [08:50:36] (03PS6) 10Muehlenhoff: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [08:51:00] (03CR) 10Elukey: modules: allow istio gateways to have more selectors (part 1) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 (owner: 10Elukey) [08:52:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [08:52:03] (03PS5) 10Elukey: modules: allow istio gateways to have more selectors (part 1) [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 [08:52:07] (03PS2) 10Elukey: modules: allow istio gateways to have more selectors (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/913115 [08:52:11] (03PS5) 10Elukey: fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 [08:52:15] (03PS5) 10Elukey: ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 [08:54:58] (03PS1) 10Alexandros Kosiaris: machinetranslation: Enable ingress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/913116 (https://phabricator.wikimedia.org/T331505) [08:55:34] (03CR) 10JMeybohm: [C: 03+1] modules: allow istio gateways to have more selectors (part 1) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 (owner: 10Elukey) [08:55:52] (03CR) 10JMeybohm: [C: 03+1] modules: allow istio gateways to have more selectors (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/913115 (owner: 10Elukey) [08:57:39] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2001.codfw.wmnet with OS bullseye [08:57:47] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-cache2001.codfw.wmnet with OS bullseye [08:58:38] (03PS2) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [08:59:43] (03CR) 10Elukey: [C: 03+2] modules: allow istio gateways to have more selectors (part 1) [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 (owner: 10Elukey) [08:59:51] (03CR) 10Elukey: [C: 03+2] modules: allow istio gateways to have more selectors (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/913115 (owner: 10Elukey) [08:59:57] (03PS3) 10Elukey: modules: allow istio gateways to have more selectors (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/913115 [09:00:01] (03CR) 10CI reject: [V: 04-1] modules: allow istio gateways to have more selectors (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/913115 (owner: 10Elukey) [09:00:35] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/913115 (owner: 10Elukey) [09:00:58] (03PS6) 10Elukey: fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 [09:06:38] (03Merged) 10jenkins-bot: modules: allow istio gateways to have more selectors (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/913115 (owner: 10Elukey) [09:06:51] (03PS7) 10Elukey: fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 [09:11:21] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage [09:11:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40953/console" [puppet] - 10https://gerrit.wikimedia.org/r/912979 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [09:12:55] (03CR) 10Elukey: [C: 03+1] kafkamon: transition to firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/912979 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [09:13:56] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage [09:15:48] (03CR) 10Elukey: [C: 03+2] fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 (owner: 10Elukey) [09:16:02] (03CR) 10Elukey: [C: 03+2] ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 (owner: 10Elukey) [09:19:04] (03PS3) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [09:24:32] (03PS1) 10Majavah: P:toolforge::prometheus: set team: wmcs on alerts [puppet] - 10https://gerrit.wikimedia.org/r/913117 (https://phabricator.wikimedia.org/T334866) [09:24:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:25:53] (03PS1) 10JMeybohm: prometheus::k8s: Add missing client tokens [labs/private] - 10https://gerrit.wikimedia.org/r/913118 (https://phabricator.wikimedia.org/T325268) [09:26:01] (03PS1) 10Majavah: P:toolforge::prometheus: redirect webroot properly [puppet] - 10https://gerrit.wikimedia.org/r/913119 [09:26:24] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] prometheus::k8s: Add missing client tokens [labs/private] - 10https://gerrit.wikimedia.org/r/913118 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:28:06] (03CR) 10David Caro: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/913117 (https://phabricator.wikimedia.org/T334866) (owner: 10Majavah) [09:28:14] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: set team: wmcs on alerts [puppet] - 10https://gerrit.wikimedia.org/r/913117 (https://phabricator.wikimedia.org/T334866) (owner: 10Majavah) [09:30:19] (03CR) 10Jbond: [C: 03+1] Remove unused role and profile for wmcs project- and home- nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/911424 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [09:31:21] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2001.codfw.wmnet with OS bullseye [09:31:27] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-cache2001.codfw.wmnet with OS bullseye completed: - ml-cache2001 (**PASS**) - Downtimed on Icinga/Aler... [09:32:05] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/913119 (owner: 10Majavah) [09:33:41] (03PS4) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [09:34:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/912365 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [09:34:47] (03PS1) 10Muehlenhoff: Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) [09:35:03] (03CR) 10Jbond: [C: 03+1] Add component/cassandra41 for Cassandra 4.1.x releases [puppet] - 10https://gerrit.wikimedia.org/r/912376 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [09:35:16] (03CR) 10CI reject: [V: 04-1] Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [09:35:47] (03PS1) 10Majavah: P:wmcs::metricsinfra: install blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/913122 (https://phabricator.wikimedia.org/T288067) [09:36:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) (owner: 10Herron) [09:39:03] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40956/console" [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:42:04] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2002.codfw.wmnet with OS bullseye [09:42:10] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-cache2002.codfw.wmnet with OS bullseye [09:47:15] (03PS2) 10Muehlenhoff: Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) [09:47:46] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service,ferm.service,prometheus-nic-firmware-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:22] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp5017,cp5025 [puppet] - 10https://gerrit.wikimedia.org/r/913123 (https://phabricator.wikimedia.org/T322774) [09:48:31] (03CR) 10Jbond: [C: 03+1] distros: add bookworm-wikimedia to known distros [puppet] - 10https://gerrit.wikimedia.org/r/912931 (owner: 10Volans) [09:49:27] (03CR) 10CI reject: [V: 04-1] Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [09:50:08] (03CR) 10FNegri: [C: 03+2] d/changelog: Prepare for 0.94 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/912885 (https://phabricator.wikimedia.org/T331336) (owner: 10FNegri) [09:50:25] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp5017,cp5025 [puppet] - 10https://gerrit.wikimedia.org/r/913123 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [09:51:51] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.94 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/912885 (https://phabricator.wikimedia.org/T331336) (owner: 10FNegri) [09:55:04] PROBLEM - Check whether ferm is active by checking the default input chain on sretest1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:55:53] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage [09:55:58] (03PS5) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [09:58:33] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage [09:59:51] (03CR) 10Ladsgroup: Enable parser cache warming jobs for parsoid on group 0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [10:01:22] !log restarting varnish on cp5017 and cp5025 to drop port 80 - T322774 [10:01:23] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40957/console" [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:40] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp5017,cp5025 [puppet] - 10https://gerrit.wikimedia.org/r/913126 (https://phabricator.wikimedia.org/T322774) [10:03:29] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp5017,cp5025 [puppet] - 10https://gerrit.wikimedia.org/r/913126 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:06:53] (03PS1) 10Elukey: ml-services: enable mesh for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/913128 [10:09:38] (03PS2) 10Elukey: ml-services: enable mesh for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/913128 [10:11:24] (03CR) 10Muehlenhoff: [C: 03+1] "Good idea, let's give that a shot." [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) (owner: 10Herron) [10:11:38] (03PS3) 10Muehlenhoff: Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) [10:11:50] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2003.codfw.wmnet with OS bullseye [10:11:56] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-cache2003.codfw.wmnet with OS bullseye [10:13:26] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2002.codfw.wmnet with OS bullseye [10:13:32] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-cache2002.codfw.wmnet with OS bullseye completed: - ml-cache2002 (**PASS**) - Downtimed on Icinga/Aler... [10:13:54] (03CR) 10CI reject: [V: 04-1] Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:16:35] (03PS4) 10Muehlenhoff: Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) [10:22:23] (03CR) 10Elukey: [C: 03+2] ml-services: enable mesh for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/913128 (owner: 10Elukey) [10:23:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [10:25:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:25:50] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage [10:25:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:27:48] (03PS1) 10Muehlenhoff: Use signed-by to in apt::package_from_component on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) [10:28:25] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage [10:31:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:39:52] (03CR) 10Muehlenhoff: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:43:49] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2003.codfw.wmnet with OS bullseye [10:43:55] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-cache2003.codfw.wmnet with OS bullseye completed: - ml-cache2003 (**PASS**) - Downtimed on Icinga/Aler... [10:46:02] (03PS2) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on g small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) [10:46:21] (03PS3) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) [10:51:19] 10SRE, 10DBA, 10observability, 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), 10Patch-For-Review: Send metrics of db errors of mediawiki to prometheus - https://phabricator.wikimedia.org/T297435 (10jcrespo) {F36966681} This is what I believe is a better graph from the same data with the instant (1 minute... [10:54:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] vendor_modules: update augeasproviders_core to 3.2.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912803 (owner: 10Jbond) [10:56:22] (03PS1) 10Jbond: vendor_modules: add magic comment [puppet] - 10https://gerrit.wikimedia.org/r/913134 (https://phabricator.wikimedia.org/T335572) [10:57:05] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: puppet: augeas and augeasprovider both define the augeas feature - https://phabricator.wikimedia.org/T335572 (10jbond) [10:57:38] (03PS1) 10Vgutierrez: aptrepo: Add thirdparty/haproxy27 [puppet] - 10https://gerrit.wikimedia.org/r/913135 [10:59:32] (03CR) 10Jbond: [C: 03+2] vendor_modules: add magic comment [puppet] - 10https://gerrit.wikimedia.org/r/913134 (https://phabricator.wikimedia.org/T335572) (owner: 10Jbond) [11:01:12] (03CR) 10Vgutierrez: "https://haproxy.debian.net/#distribution=Debian&release=bullseye&version=2.7 can be used to validate this CR" [puppet] - 10https://gerrit.wikimedia.org/r/913135 (owner: 10Vgutierrez) [11:03:28] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.958% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:07:01] 10SRE-swift-storage, 10Thumbor, 10Platform Team Workboards (Platform Engineering Reliability), 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly (as thumbor-k8s lack noto fonts) - https://phabricator.wikimedia.org/T335271 (10hnowlan) These images have been purged. Closing for now, please... [11:07:19] 10SRE-swift-storage, 10Thumbor, 10Platform Team Workboards (Platform Engineering Reliability), 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly (as thumbor-k8s lack noto fonts) - https://phabricator.wikimedia.org/T335271 (10hnowlan) 05Open→03Resolved a:03hnowlan [11:08:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/913135 (owner: 10Vgutierrez) [11:15:31] (03CR) 10Muehlenhoff: vendor_modules: update augeasproviders_core to 3.2.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912803 (owner: 10Jbond) [11:30:46] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks for this. I'll update the deb package next week." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [11:30:49] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] wmf-update-known-hosts-production: Automatically download DNS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [11:35:32] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10fgiunchedi) [11:48:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10Jclark-ctr) [11:48:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jclark-ctr) [11:55:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) [11:57:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) sretest1003 A4 U31 PORT 45 CABLEID 230304500290 [11:57:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) [11:58:54] (03PS1) 10JMeybohm: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) [11:59:39] (03CR) 10CI reject: [V: 04-1] prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:01:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [12:02:54] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [12:03:02] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:03:25] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40958/console" [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:04:02] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [12:05:03] (03PS2) 10Alexandros Kosiaris: services_proxy: Comment port re-use [puppet] - 10https://gerrit.wikimedia.org/r/912244 [12:05:05] (03PS3) 10Alexandros Kosiaris: services_proxy: Add machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/911887 (https://phabricator.wikimedia.org/T331505) [12:05:07] (03PS1) 10Alexandros Kosiaris: service::catalog: Add machinetranslation service [puppet] - 10https://gerrit.wikimedia.org/r/913152 (https://phabricator.wikimedia.org/T331505) [12:05:38] (03PS6) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [12:05:40] (03PS2) 10JMeybohm: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) [12:06:25] (03PS1) 10Muehlenhoff: Fix docker-reporter config for legacy images [puppet] - 10https://gerrit.wikimedia.org/r/913153 (https://phabricator.wikimedia.org/T335282) [12:06:26] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [12:06:43] (03CR) 10jenkins-bot: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:08:28] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new server sretest1003 - jclark@cumin1001" [12:08:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) [12:11:02] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40959/console" [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:11:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] services_proxy: Comment port re-use [puppet] - 10https://gerrit.wikimedia.org/r/912244 (owner: 10Alexandros Kosiaris) [12:14:46] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) [12:15:12] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) ssh key verified and checked against WMCS. Still pending the approvals listed at T335529#8812573 [12:16:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Bump limitranges and resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/913108 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:16:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Enable thanos-swift service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/913109 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:16:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Enable ingress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/913116 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:18:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/913110 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [12:23:10] (03Merged) 10jenkins-bot: machinetranslation: Bump limitranges and resourcequotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/913108 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:23:12] (03Merged) 10jenkins-bot: machinetranslation: Enable thanos-swift service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/913109 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:24:02] (03Merged) 10jenkins-bot: machinetranslation: Enable ingress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/913116 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:25:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10Marostegui) [12:29:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [12:29:18] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [12:29:34] (03CR) 10Jbond: [C: 03+1] Use signed-by to in apt::package_from_component on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [12:29:47] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [12:30:17] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:30:28] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: puppet: augeas and augeasprovider both define the augeas feature - https://phabricator.wikimedia.org/T335572 (10jbond) 05Open→03Stalled The patch has fixed the issues for now, ill wait to see what upstream says [12:31:20] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:34:30] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:35:29] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:35:47] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:36:08] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:37:50] (03PS3) 10JMeybohm: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) [12:41:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] prometheus::k8s switch staging-codfw to client cert auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:42:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] prometheus::k8s switch staging-codfw to client cert auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:43:02] 10SRE-swift-storage, 10Thumbor, 10Platform Team Workboards (Platform Engineering Reliability), 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly (as thumbor-k8s lack noto fonts) - https://phabricator.wikimedia.org/T335271 (10TheDJ) FYI, SVG authors consider this to be the official list of... [12:43:23] akosiaris: sorry, have not marked as WIP :/ [12:47:29] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/912853 (https://phabricator.wikimedia.org/T335354) (owner: 10Muehlenhoff) [12:50:03] 10SRE-swift-storage, 10Thumbor, 10Platform Team Workboards (Platform Engineering Reliability), 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly (as thumbor-k8s lack noto fonts) - https://phabricator.wikimedia.org/T335271 (10akosiaris) >>! In T335271#8813201, @TheDJ wrote: > FYI, SVG auth... [12:52:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:01:29] (03CR) 10Vgutierrez: [C: 03+2] aptrepo: Add thirdparty/haproxy27 [puppet] - 10https://gerrit.wikimedia.org/r/913135 (owner: 10Vgutierrez) [13:06:44] (03PS1) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [13:07:12] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [13:08:36] (03CR) 10Alexandros Kosiaris: wikifunctions: Add AppArmor profile usage (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [13:21:50] !log import haproxy 2.7.7 on apt.wm.o thirdparty/haproxy27 for bullseye [13:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:06] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [13:28:09] (03CR) 10Herron: [C: 03+1] opensearch: add disable_security_plugin option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [13:29:49] (03PS2) 10Slyngshede: Requisition approval functionality. [software/bitu] - 10https://gerrit.wikimedia.org/r/911249 [13:36:17] (03CR) 10Herron: [C: 03+2] "thanks for the reviews!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) (owner: 10Herron) [13:39:27] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:40:07] (03Merged) 10jenkins-bot: ganeti: enable --no-wait-for-sync by default [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) (owner: 10Herron) [13:40:26] (03PS1) 10Vgutierrez: haproxy: Add haproxy 2.7 to the list of versions [puppet] - 10https://gerrit.wikimedia.org/r/913186 [13:44:15] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:09:26] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40964/console" [puppet] - 10https://gerrit.wikimedia.org/r/913186 (owner: 10Vgutierrez) [14:09:51] (03PS8) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [14:10:44] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Add haproxy 2.7 to the list of versions [puppet] - 10https://gerrit.wikimedia.org/r/913186 (owner: 10Vgutierrez) [14:22:22] (03PS1) 10Andrea Denisse: prometheus: Failover DNS from prometheus3001 to prometheus3002 in esams [dns] - 10https://gerrit.wikimedia.org/r/913192 (https://phabricator.wikimedia.org/T309979) [14:22:54] (03CR) 10David Caro: OpenStack: add a clouds.yaml file for environment setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:23:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [14:24:31] (03CR) 10JMeybohm: [V: 03+1] "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:33:06] (03PS1) 10Andrea Denisse: prometheus: Failover DNS from prometheus4001 to prometheus4002 in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/913194 (https://phabricator.wikimedia.org/T309979) [14:35:41] (03PS1) 10Elukey: fastapi-app: upgrade the chart after another run of sextant [deployment-charts] - 10https://gerrit.wikimedia.org/r/913195 [14:35:58] (03PS1) 10Andrea Denisse: prometheus: Failover DNS from prometheus5001 to prometheus5002 in eqsin [dns] - 10https://gerrit.wikimedia.org/r/913196 (https://phabricator.wikimedia.org/T309979) [14:37:05] (03PS1) 10Elukey: ml-services: add mesh public port to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/913197 [14:37:19] (03PS2) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [14:37:30] (03Abandoned) 10Elukey: ml-services: enable mesh for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/910521 (owner: 10Elukey) [14:37:46] (03PS1) 10Andrea Denisse: prometheus: Failover DNS from prometheus6001 to prometheus6002 in drmrs [dns] - 10https://gerrit.wikimedia.org/r/913198 (https://phabricator.wikimedia.org/T309979) [14:37:50] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [14:38:32] (03PS13) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [14:38:44] (03CR) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [14:38:51] (03PS9) 10Krinkle: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) [14:39:21] (03CR) 10CI reject: [V: 04-1] webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [14:39:54] (03CR) 10CI reject: [V: 04-1] coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [14:40:14] (03PS14) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [14:40:20] (03PS10) 10Krinkle: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) [14:40:32] (03PS4) 10Hnowlan: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196) [14:43:11] (03PS1) 10EoghanGaffney: [gitlab/runner] Add basic pool/depool commands [puppet] - 10https://gerrit.wikimedia.org/r/913199 [14:45:29] (03CR) 10CI reject: [V: 04-1] [gitlab/runner] Add basic pool/depool commands [puppet] - 10https://gerrit.wikimedia.org/r/913199 (owner: 10EoghanGaffney) [14:45:45] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40965/console" [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736) (owner: 10EoghanGaffney) [14:50:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1004'] [14:51:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['stat1004'] [14:53:31] (03CR) 10BCornwall: [V: 03+1 C: 03+2] wmflib: Add Maglev Hashing (mh) to supported types [puppet] - 10https://gerrit.wikimedia.org/r/912365 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [14:55:33] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) >>! In T288629#8812688, @JMeybohm wrote: > That sounds like it would not be blocking me curren... [14:57:11] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1004'] [14:57:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['stat1004'] [14:58:30] (03PS2) 10Elukey: ml-services: add mesh public port to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/913197 [15:01:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2519 [15:02:20] (03PS1) 10Andrew Bogott: nfs-exportd: Don't crash out if a dns lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/913200 (https://phabricator.wikimedia.org/T335336) [15:04:03] (03PS2) 10Andrew Bogott: nfs-exportd: Don't crash out if a dns lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/913200 (https://phabricator.wikimedia.org/T335336) [15:06:12] (03CR) 10Filippo Giunchedi: "Looks good -- only the now-unused config template to be removed, other than that we're good to go" [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [15:06:48] (03PS15) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [15:07:03] (03CR) 10Krinkle: "Re-tested on beta cluster as well, all good there." [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [15:07:13] (03CR) 10CI reject: [V: 04-1] webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [15:07:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2519 [15:08:04] (03PS16) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [15:08:12] (03PS11) 10Krinkle: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) [15:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.863% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:09:38] (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:16:02] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10thcipriani) >>! In T335529#8812573, @Marostegui wrote: > @thcipriani we need your approval for the `restricted` group Approved [15:16:10] (03CR) 10Ayounsi: "Is there a risk of double tasks in cases like https://phabricator.wikimedia.org/F36962560 ? One for Status, the other for PS Redundancy." [alerts] - 10https://gerrit.wikimedia.org/r/913110 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [15:17:39] (03CR) 10Krinkle: coal: Uninstall from webperf role and start decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [15:18:35] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10Papaul) @jbond as for 10 Mars 2023 the IDRAC latest version for PowerEdge R430 is 2.84 or to be able to run the firmware cookbook Redfish wants the idrac to be at minimum 3.30 . So I thi... [15:19:10] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10Marostegui) [15:23:28] !log update schema for backup1-codfw (mediabackups) T327157 [15:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:33] T327157: Create and deploy the logic to generate incremental backups of MediaWiki media files, to keep its file storage backup up to date, automatically - https://phabricator.wikimedia.org/T327157 [15:23:38] (03PS2) 10Elukey: fastapi-app: upgrade the chart after another run of sextant [deployment-charts] - 10https://gerrit.wikimedia.org/r/913195 [15:23:40] (03PS3) 10Elukey: ml-services: add mesh public port to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/913197 [15:25:40] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Added a comment, otherwise looks fine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/913195 (owner: 10Elukey) [15:25:52] (03CR) 10Elukey: "I have re-ran sextant to generate the fastapi-app chart, and it fixed a lot of things. The fixes are all in https://gerrit.wikimedia.org/r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) (owner: 10Giuseppe Lavagetto) [15:27:09] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:31:03] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:31:58] (03CR) 10Elukey: [C: 03+2] fastapi-app: upgrade the chart after another run of sextant (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/913195 (owner: 10Elukey) [15:33:16] (03CR) 10Elukey: [C: 03+2] ml-services: add mesh public port to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/913197 (owner: 10Elukey) [15:35:47] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:38:57] (03CR) 10Cwhite: [C: 03+2] opensearch: add disable_security_plugin option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [15:39:18] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:54:40] (03PS1) 10Jcrespo: Prepare for 0.1.7 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/913234 [15:55:14] (03CR) 10Jcrespo: [C: 03+2] Update sql to add newly history table file_history [software/mediabackups] - 10https://gerrit.wikimedia.org/r/892891 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [15:55:31] (03CR) 10Jcrespo: [C: 03+2] backup_update: Fix logging on successful file history update [software/mediabackups] - 10https://gerrit.wikimedia.org/r/902672 (owner: 10Jcrespo) [15:55:40] (03CR) 10CI reject: [V: 04-1] Prepare for 0.1.7 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/913234 (owner: 10Jcrespo) [15:55:52] (03PS2) 10Jcrespo: backup_update: Fix logging on successful file history update [software/mediabackups] - 10https://gerrit.wikimedia.org/r/902672 [15:56:04] (03CR) 10Jcrespo: [C: 03+2] Create new script to read recent logs and update backups metadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/911943 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [15:56:24] (03PS2) 10Jcrespo: Create new script to read recent logs and update backups metadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/911943 (https://phabricator.wikimedia.org/T327157) [15:56:37] (03CR) 10Jcrespo: [V: 03+2] Create new script to read recent logs and update backups metadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/911943 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [15:56:58] (03PS3) 10Jcrespo: Update indexes for latest queries needed for mediabackups [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912818 (https://phabricator.wikimedia.org/T327157) [15:57:15] (03PS2) 10Jcrespo: Add functionality to detect last uploaded time for backup start [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912819 (https://phabricator.wikimedia.org/T327157) [15:57:26] (03PS3) 10Jcrespo: recentuploads: Set custom headers for querying the mediawiki api [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912935 (https://phabricator.wikimedia.org/T327157) [15:57:41] (03PS2) 10Jcrespo: Prepare for 0.1.7 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/913234 [15:58:18] (03CR) 10Jcrespo: [C: 03+2] Update indexes for latest queries needed for mediabackups [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912818 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [15:58:26] (03CR) 10Jcrespo: [C: 03+2] Add functionality to detect last uploaded time for backup start [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912819 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [15:58:35] (03CR) 10Jcrespo: [C: 03+2] recentuploads: Set custom headers for querying the mediawiki api [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912935 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [15:58:45] (03CR) 10Jcrespo: [C: 03+2] Prepare for 0.1.7 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/913234 (owner: 10Jcrespo) [15:59:47] (03CR) 10Elukey: "Found a missing thing:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) (owner: 10Giuseppe Lavagetto) [16:00:17] (03CR) 10Elukey: Re-visit scaffolding (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) (owner: 10Giuseppe Lavagetto) [16:07:47] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:08:55] (03PS9) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [16:09:25] (03CR) 10CI reject: [V: 04-1] OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:10:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:14:02] (03CR) 10Herron: [C: 03+1] opensearch: add disable_security_plugin option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [16:16:10] (03PS10) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [16:16:54] (03CR) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:17:23] (03CR) 10Cwhite: [C: 03+2] opensearch: add disable_security_plugin option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [16:26:09] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) 05In progress→03Stalled [16:31:01] (03CR) 10Herron: [C: 03+1] opensearch: add disable_security_plugin option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [16:31:25] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:32:52] (03Abandoned) 10BCornwall: lists: Disable access on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [16:34:55] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:36:47] (03CR) 10Dzahn: [C: 04-1] "alright! Since we have quite the comment history here with fair points still tbd, but to also unblock this. I am going to separate the "ce" [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:49:32] (03PS1) 10Jdlrobson: Explicitly enable MFCustomSiteModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603) [16:52:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [16:53:46] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10BCornwall) Sent a message to the ops team (message-id: (03PS11) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [17:13:37] (03PS12) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [17:16:08] (03PS13) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [17:19:10] (03CR) 10Dzahn: [C: 03+1] "I remember a long time ago we had an incident with a server rebooted by accident, then we added the pattern that we have to type the host " [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall) [17:21:29] (03PS14) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [17:50:49] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@d56b7fb]: (no justification provided) [17:50:59] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@d56b7fb]: (no justification provided) (duration: 00m 10s) [18:00:13] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frbast2002, frauth2002 - https://phabricator.wikimedia.org/T334505 (10Dwisehaupt) [18:00:46] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Dwisehaupt) [18:01:05] (03PS3) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [18:01:29] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10RLazarus) Thanks for raising the question! I don't think this is the right solution for the problem. The ana... [18:01:38] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [18:15:39] (03PS1) 10Ottomata: page_content_change_enrichment - update with latest image and parameterization [deployment-charts] - 10https://gerrit.wikimedia.org/r/913245 (https://phabricator.wikimedia.org/T328478) [18:20:51] (03CR) 10CI reject: [V: 04-1] page_content_change_enrichment - update with latest image and parameterization [deployment-charts] - 10https://gerrit.wikimedia.org/r/913245 (https://phabricator.wikimedia.org/T328478) (owner: 10Ottomata) [18:24:11] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [18:29:05] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:32:58] (03PS2) 10Ottomata: page_content_change_enrichment - update with latest image and parameterization [deployment-charts] - 10https://gerrit.wikimedia.org/r/913245 (https://phabricator.wikimedia.org/T328478) [18:33:08] (03PS3) 10Ottomata: page_content_change_enrichment - update with latest image and parameterization [deployment-charts] - 10https://gerrit.wikimedia.org/r/913245 (https://phabricator.wikimedia.org/T328478) [18:35:44] (03PS1) 10Andrew Bogott: OpenStack envscripts: fix some misnamed variables and remove newlines [puppet] - 10https://gerrit.wikimedia.org/r/913247 [18:36:30] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack envscripts: fix some misnamed variables and remove newlines [puppet] - 10https://gerrit.wikimedia.org/r/913247 (owner: 10Andrew Bogott) [18:43:44] (03PS1) 10Andrew Bogott: envscripts: fix novaobserver domain IDs [puppet] - 10https://gerrit.wikimedia.org/r/913248 [18:44:53] (03PS1) 10Andrea Denisse: prometheus: Decommission prometheus3001 in esams [puppet] - 10https://gerrit.wikimedia.org/r/913249 (https://phabricator.wikimedia.org/T33558) [18:44:56] (03CR) 10Ottomata: [C: 03+2] page_content_change_enrichment - update with latest image and parameterization [deployment-charts] - 10https://gerrit.wikimedia.org/r/913245 (https://phabricator.wikimedia.org/T328478) (owner: 10Ottomata) [18:46:08] (03CR) 10Andrew Bogott: [C: 03+2] envscripts: fix novaobserver domain IDs [puppet] - 10https://gerrit.wikimedia.org/r/913248 (owner: 10Andrew Bogott) [18:50:04] (03PS1) 10Andrea Denisse: prometheus: Decommission prometheus4001 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/913250 (https://phabricator.wikimedia.org/T335585) [18:51:58] (03Merged) 10jenkins-bot: page_content_change_enrichment - update with latest image and parameterization [deployment-charts] - 10https://gerrit.wikimedia.org/r/913245 (https://phabricator.wikimedia.org/T328478) (owner: 10Ottomata) [18:53:21] (03PS1) 10Andrea Denisse: prometheus: Decommission prometheus5001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/913251 (https://phabricator.wikimedia.org/T335587) [18:56:29] (03PS4) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [18:56:54] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [18:57:48] (03PS2) 10Andrea Denisse: prometheus: Decommission prometheus5001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/913251 (https://phabricator.wikimedia.org/T335587) [18:58:57] (03PS1) 10Ottomata: page_content_change_enrichment - apply values currently running in DSE [deployment-charts] - 10https://gerrit.wikimedia.org/r/913252 (https://phabricator.wikimedia.org/T332948) [19:01:03] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page_content_change_enrichment - apply values currently running in DSE [deployment-charts] - 10https://gerrit.wikimedia.org/r/913252 (https://phabricator.wikimedia.org/T332948) (owner: 10Ottomata) [19:03:09] (03PS1) 10Ottomata: page_content_change_enrichment - fix type [deployment-charts] - 10https://gerrit.wikimedia.org/r/913253 [19:03:40] (03PS2) 10Ottomata: page_content_change_enrichment - fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/913253 [19:03:46] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page_content_change_enrichment - fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/913253 (owner: 10Ottomata) [19:04:45] (03PS1) 10Ottomata: mediawiki_page_content_chnage - Remove duplicate flink conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/913254 [19:04:51] (03PS2) 10Ottomata: mediawiki_page_content_chnage - Remove duplicate flink conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/913254 [19:05:02] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mediawiki_page_content_chnage - Remove duplicate flink conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/913254 (owner: 10Ottomata) [19:07:38] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:07:45] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.769% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:09:45] (03PS1) 10Ottomata: page_content_change - fix type in --config file path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913255 [19:10:01] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page_content_change - fix type in --config file path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913255 (owner: 10Ottomata) [19:10:32] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:10:37] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:14:50] (03PS1) 10Andrea Denisse: prometheus: Decommission prometheus6001 in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/913256 (https://phabricator.wikimedia.org/T335588) [19:15:09] (03PS1) 10Ottomata: page_content_change - set correct error sink [deployment-charts] - 10https://gerrit.wikimedia.org/r/913257 [19:15:40] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page_content_change - set correct error sink [deployment-charts] - 10https://gerrit.wikimedia.org/r/913257 (owner: 10Ottomata) [19:16:28] (03CR) 10Andrea Denisse: [C: 03+1] opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [19:16:43] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:16:49] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:52] (03PS1) 10Ottomata: page_content_change - set kafka consumer group [deployment-charts] - 10https://gerrit.wikimedia.org/r/913258 [19:19:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page_content_change - set kafka consumer group [deployment-charts] - 10https://gerrit.wikimedia.org/r/913258 (owner: 10Ottomata) [19:20:17] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:20:23] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:39:17] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:45] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:06:47] (03PS7) 10Dzahn: gerrit: add Prometheus blackbox https monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) [20:21:57] (03CR) 10Dzahn: [C: 03+2] "amended to check for string "Gerrit Code Review", changed to only add monitoring, does not remove anything we already have." [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [20:24:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: setup [20:24:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: setup [20:24:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1003.wikimedia.org with reason: setup [20:24:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1003.wikimedia.org with reason: setup [20:25:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit2002.wikimedia.org with reason: setup [20:25:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: setup [20:35:33] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:45] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:49:57] (03PS1) 10Dzahn: gerrit: follow_redirects in blackbox::http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/913262 (https://phabricator.wikimedia.org/T329587) [20:52:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [20:52:19] (03CR) 10Dzahn: [C: 03+2] "https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=h@c823129&_a=h@411c49f" [puppet] - 10https://gerrit.wikimedia.org/r/913262 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [21:00:55] (03PS1) 10Dzahn: gerrit: expect http status 302 from / in blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/913263 [21:01:14] (03CR) 10Dzahn: [C: 03+2] gerrit: expect http status 302 from / in blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/913263 (owner: 10Dzahn) [21:11:51] (03PS1) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [21:12:43] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [21:21:58] (03CR) 10Dzahn: [C: 03+2] "one more followup https://gerrit.wikimedia.org/r/c/operations/puppet/+/913263 but now it works. see proof at https://thanos.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/913262 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [21:26:36] (ProbeDown) firing: (6) Service gerrit1001:443 has failed probes (http_gerrit_tls_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:27:36] (03PS2) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [21:27:46] sigh, I tried to hard to NOT have that alert fire..that I just added [21:27:55] and checked the dashboards the entire time [21:28:06] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [21:28:10] and it shows success there now after follow-up [21:36:38] (03PS1) 10Dzahn: gerrit: accept http status 404 in blackbox http monitor, for now [puppet] - 10https://gerrit.wikimedia.org/r/913272 (https://phabricator.wikimedia.org/T329587) [21:44:40] (03CR) 10Dzahn: [C: 03+2] gerrit: accept http status 404 in blackbox http monitor, for now [puppet] - 10https://gerrit.wikimedia.org/r/913272 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [21:45:27] (03PS3) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [21:46:10] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [21:46:51] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [21:51:36] (ProbeDown) firing: (4) Service gerrit1001:443 has failed probes (http_gerrit_tls_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:52:12] ^ frustrating. I do everything I can to not make that happen.. yet it does [21:52:19] I will just revert soon [21:52:28] nothing is wrong with gerrit [21:52:53] and what we see here somehow doesnt seem to match what I see in thanos and logstash [21:53:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: setup [21:53:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: setup [21:56:37] (03PS1) 10Dzahn: gerrit: accept 200 in addition to 302 and 404 in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/913273 (https://phabricator.wikimedia.org/T329587) [21:57:05] (03CR) 10Dzahn: [C: 03+2] gerrit: accept 200 in addition to 302 and 404 in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/913273 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:07:36] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frbast2002, frauth2002 - https://phabricator.wikimedia.org/T334505 (10Dwisehaupt) @Papaul looks like the netbox entries for the mgmt interfaces for these hosts were missing the DNS names. I have updated it in netbox but don't have the access to be able to ru... [22:08:27] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:14:58] is there someone I can check with that a service is supposed to be publicly accessible? [22:16:36] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:18:45] lazyreader39: sure, feel free to PM [22:20:56] thanks, handled! [22:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [22:29:17] (03PS1) 10Dzahn: gerrit: do not monitor the replica [puppet] - 10https://gerrit.wikimedia.org/r/913275 (https://phabricator.wikimedia.org/T329587) [22:31:15] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:31:40] (03CR) 10Dzahn: [C: 03+2] gerrit: do not monitor the replica [puppet] - 10https://gerrit.wikimedia.org/r/913275 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:33:11] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for new frack nodes - pt1979@cumin2002" [22:36:34] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:37:13] (03CR) 10Dzahn: [C: 03+2] "finally https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=probe_success%7Binstance%3D~%22.*gerrit.*%22%7D&g0.max_source_resoluti" [puppet] - 10https://gerrit.wikimedia.org/r/913275 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:46:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for new frack nodes - pt1979@cumin2002" [22:46:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:56:06] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Dwisehaupt) [22:58:43] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Dwisehaupt) Host built and successful puppet runs complete. Still need to migrate the current prometheus data and grafana configs. [23:05:01] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 54528 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [23:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.675% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:33:13] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite) [23:42:56] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Dwisehaupt) Sync of grafana from frmon1001 complete. Sync of prometheus data from frmon2001 complete.