[00:38:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/988243 [00:38:58] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/988243 (owner: 10TrainBranchBot) [00:58:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/988243 (owner: 10TrainBranchBot) [01:03:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T354580 (10phaultfinder) [01:05:08] RECOVERY - cassandra-b CQL 10.192.48.238:9042 on restbase2035 is OK: TCP OK - 0.031 second response time on 10.192.48.238 port 9042 https://phabricator.wikimedia.org/T93886 [01:11:50] RECOVERY - cassandra-c service on restbase2035 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:12:12] RECOVERY - cassandra-c SSL 10.192.48.239:7000 on restbase2035 is OK: SSL OK - Certificate restbase2035-c valid until 2025-12-07 21:03:47 +0000 (expires in 698 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:17:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [01:22:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [01:26:54] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Papaul) 05Open→03Resolved a:03Papaul @Jhancock.wm what i did for the provision cookbook to PASSws to reset the IDRAC password and re-run the cookbook again @Dzahn the host is backup . [01:37:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:10] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:39:10] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T0300) [03:07:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.13 [core] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/988244 (https://phabricator.wikimedia.org/T350089) [03:07:32] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.13 [core] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/988244 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [03:09:10] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:17] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [03:10:18] RECOVERY - Dell PowerEdge RAID Controller on dumpsdata1006 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [03:10:44] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [03:10:53] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [03:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:11:12] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [03:11:23] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [03:11:47] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [03:25:20] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.13 [core] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/988244 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [03:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [03:55:16] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:59:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:59:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T0400) [04:01:24] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:07:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:08:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:08:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:36:45] (03PS2) 10KartikMistry: testwiki: Enable Section translation on WPs with potential to be supported with MinT using MADLAD-400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988493 (https://phabricator.wikimedia.org/T353510) [04:37:54] RECOVERY - cassandra-c CQL 10.192.48.239:9042 on restbase2035 is OK: TCP OK - 0.032 second response time on 10.192.48.239 port 9042 https://phabricator.wikimedia.org/T93886 [05:03:48] (03PS6) 10Strainu: [namespaces] Use correct diacritics in Romanian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) [05:19:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:46] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:10] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:10:08] (03PS1) 10Marostegui: db2151: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/988760 (https://phabricator.wikimedia.org/T354506) [06:10:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2151 T354506', diff saved to https://phabricator.wikimedia.org/P54552 and previous config saved to /var/cache/conftool/dbconfig/20240109-061015-root.json [06:10:19] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [06:11:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2151.codfw.wmnet with OS bookworm [06:11:20] (03CR) 10Marostegui: [C: 03+2] db2151: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/988760 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:27:30] 10ops-eqiad, 10DBA, 10DC-Ops: db1224 hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) [06:27:45] 10ops-eqiad, 10DBA, 10DC-Ops: db1224 hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) p:05Triage→03Medium [06:28:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1224', diff saved to https://phabricator.wikimedia.org/P54553 and previous config saved to /var/cache/conftool/dbconfig/20240109-062806-root.json [06:28:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: host reimage [06:29:06] 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) [06:29:42] (03PS1) 10Marostegui: installserver: Do not reimage db1249 [puppet] - 10https://gerrit.wikimedia.org/r/988761 [06:32:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: host reimage [06:33:28] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db1249 [puppet] - 10https://gerrit.wikimedia.org/r/988761 (owner: 10Marostegui) [06:35:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 1%: After a crash', diff saved to https://phabricator.wikimedia.org/P54554 and previous config saved to /var/cache/conftool/dbconfig/20240109-063528-root.json [06:36:01] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) If a reboot/power off is needed, please let us, as we'd need to depool+stop mariadb. [06:40:36] (03CR) 10Gergő Tisza: beta: Temporarily change default value for 4 Echo properties (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [06:41:15] (03PS1) 10Marostegui: Revert "db2151: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/988689 [06:41:40] (03CR) 10Gergő Tisza: beta: Temporarily change default value for 4 Echo properties (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [06:42:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:39] (03CR) 10Gergő Tisza: [C: 03+1] beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [06:42:41] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:57] (03CR) 10Gergő Tisza: [C: 03+1] beta: Enable conditional defaults for 4 Echo properties (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [06:48:41] (03CR) 10Marostegui: [C: 03+2] Revert "db2151: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/988689 (owner: 10Marostegui) [06:49:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54555 and previous config saved to /var/cache/conftool/dbconfig/20240109-064916-root.json [06:50:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 5%: After a crash', diff saved to https://phabricator.wikimedia.org/P54556 and previous config saved to /var/cache/conftool/dbconfig/20240109-065033-root.json [06:51:02] (03PS1) 10Marostegui: db2151: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/988762 [06:52:27] (03CR) 10Marostegui: [C: 03+2] db2151: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/988762 (owner: 10Marostegui) [06:54:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2151.codfw.wmnet with OS bookworm [06:58:40] (03PS1) 10RLazarus: Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) [06:58:42] (03PS1) 10RLazarus: admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) [06:59:27] (03PS1) 10RLazarus: mediawiki: Support one-off jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) [06:59:29] (03CR) 10CI reject: [V: 04-1] Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [06:59:31] (03PS1) 10RLazarus: Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T0700) [07:00:05] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T0700). [07:00:08] (03PS1) 10RLazarus: deployment_server: Add mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) [07:00:12] (03CR) 10CI reject: [V: 04-1] admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [07:01:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2143.codfw.wmnet with OS bookworm [07:01:14] (03CR) 10CI reject: [V: 04-1] Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [07:04:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54557 and previous config saved to /var/cache/conftool/dbconfig/20240109-070421-root.json [07:05:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: After a crash', diff saved to https://phabricator.wikimedia.org/P54558 and previous config saved to /var/cache/conftool/dbconfig/20240109-070538-root.json [07:10:16] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:12:42] (03PS2) 10RLazarus: Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) [07:12:44] (03PS2) 10RLazarus: admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) [07:13:31] (03CR) 10CI reject: [V: 04-1] admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [07:13:36] (03CR) 10CI reject: [V: 04-1] Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [07:15:07] <_joe_> jouncebot: nowandnext [07:15:07] For the next 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T0700) [07:15:07] For the next 0 hour(s) and 14 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T0700) [07:15:07] In 0 hour(s) and 44 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T0800) [07:19:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54559 and previous config saved to /var/cache/conftool/dbconfig/20240109-071926-root.json [07:20:09] (03CR) 10RLazarus: Add helmfile for running MediaWiki one-off jobs. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [07:20:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: After a crash', diff saved to https://phabricator.wikimedia.org/P54560 and previous config saved to /var/cache/conftool/dbconfig/20240109-072043-root.json [07:27:19] (03CR) 10RLazarus: "I'm open to bikeshedding on both the name of the command and the name/path of the Puppet class -- I'm not especially married to either." [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [07:31:36] (03CR) 10Stevemunene: [C: 03+1] Disable monitoring on dbstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/987420 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [07:33:15] (03PS3) 10RLazarus: Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) [07:33:17] (03PS3) 10RLazarus: admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) [07:33:54] (03CR) 10CI reject: [V: 04-1] Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [07:33:58] (03CR) 10CI reject: [V: 04-1] admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [07:34:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54561 and previous config saved to /var/cache/conftool/dbconfig/20240109-073431-root.json [07:35:01] 10ops-codfw, 10DBA: db2143 not rebooting - https://phabricator.wikimedia.org/T354593 (10Marostegui) [07:35:10] 10ops-codfw, 10DBA: db2143 not rebooting - https://phabricator.wikimedia.org/T354593 (10Marostegui) p:05Triage→03Medium [07:35:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: After a crash', diff saved to https://phabricator.wikimedia.org/P54562 and previous config saved to /var/cache/conftool/dbconfig/20240109-073548-root.json [07:38:29] (03PS4) 10RLazarus: Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) [07:38:31] (03PS4) 10RLazarus: admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) [07:39:29] morning all! I'm seeing NoData on "DispatchChanges Normal job backlog time (p50, 15min) alert" - no sure from the run book which "the correct datacenter" is. But is anyone looking into this? [07:49:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54563 and previous config saved to /var/cache/conftool/dbconfig/20240109-074936-root.json [07:49:39] <_joe_> codders: not sure what you meant with that message, but wikidata dispatch issues are usually handled by you folks or the data engineering team; And unless there's a task or a page, I doubt anyone is looking into whatever your issue is [07:49:58] <_joe_> "a page" intended as a paging alert [07:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [07:50:31] okay - thanks for the tip :) [07:50:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: After a crash', diff saved to https://phabricator.wikimedia.org/P54564 and previous config saved to /var/cache/conftool/dbconfig/20240109-075053-root.json [07:51:05] <_joe_> codders: what problem are you seeing? it wasn't very clear from your message [07:51:09] just didn't want to spend my morning looking into something if it's a known issue that's being worked on. [07:51:22] on this dashboard I have an alert: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=5m&from=now-2d&to=now [07:51:58] one of our displays has been receiving no data for the last 15h. It seems to be a known issue https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Runbooks/Change_dispatching/Alert#DispatchChanges_normal_job_backlog_time_(p50,_15min) [07:52:18] Just that my colleagues aren't awake yet, and I'm new, so I'm digging around a bit [07:54:09] (03CR) 10Hashar: "The two deployments servers have a different PCC output. deploy2002 has much more changes and adds TCP port 1873." [puppet] - 10https://gerrit.wikimedia.org/r/987779 (owner: 10Muehlenhoff) [07:55:15] <_joe_> so the datacenter part is obsolete [07:55:17] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:55:31] <_joe_> but I'd like to understand why this happened [07:58:11] <_joe_> and it seems not to be limited to just your job [07:59:36] hmm. where do you see that? [08:00:04] Amir1 and Urbanecm: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T0800). [08:00:05] strainu, kart_, and _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:31] * kart_ is here [08:02:04] Should I go ahead with my patch since strainu is not around. [08:02:12] (03CR) 10Muehlenhoff: deployment servers: Switch rsync service to use firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987779 (owner: 10Muehlenhoff) [08:02:59] OK. Going with my patch.. [08:04:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54565 and previous config saved to /var/cache/conftool/dbconfig/20240109-080441-root.json [08:04:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988493 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [08:05:49] (03Merged) 10jenkins-bot: testwiki: Enable Section translation on WPs with potential to be supported with MinT using MADLAD-400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988493 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [08:05:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: After a crash', diff saved to https://phabricator.wikimedia.org/P54566 and previous config saved to /var/cache/conftool/dbconfig/20240109-080558-root.json [08:06:29] !log kartik@deploy2002 Started scap: Backport for [[gerrit:988493|testwiki: Enable Section translation on WPs with potential to be supported with MinT using MADLAD-400 (T353510)]] [08:06:33] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [08:10:35] !log kartik@deploy2002 kartik: Backport for [[gerrit:988493|testwiki: Enable Section translation on WPs with potential to be supported with MinT using MADLAD-400 (T353510)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:11:50] !log kartik@deploy2002 kartik: Continuing with sync [08:15:03] <_joe_> codders: it's an issue with data collection [08:16:34] interesting. can I / we do anything to support there? Or is that someone else's department? [08:17:25] <_joe_> it's someone else's :) [08:17:31] whew! :) [08:17:51] <_joe_> one of our main metrics collector, the one for the "right" datacenter, seems unresponsive [08:17:58] <_joe_> or better, it responds with no data :) [08:18:22] that does indeed sound like a problem out of my control [08:18:55] in the process of fixing, I'll update this channel [08:19:03] right is eqiad or codfw in this case? [08:19:13] thanks godog! [08:19:27] sure np codders, codfw is right [08:19:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54567 and previous config saved to /var/cache/conftool/dbconfig/20240109-081946-root.json [08:20:30] !log set aside WAL for prometheus@k8s in codfw and restart - T354399 [08:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:33] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [08:21:06] <_joe_> codders: as for the "right" datacenter, it means the one that's primary for mediawiki [08:21:16] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2143.codfw.wmnet with OS bookworm [08:21:25] got it [08:21:38] <_joe_> we have two, one in texas (codfw), one in virginia (eqiad). In the winter mediawiki migrates to the warmer climate of texas [08:21:41] 10SRE, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [08:22:05] <_joe_> so from the autumn eqinox to the spring one, primary is codfw [08:22:10] <_joe_> for the rest of the year, it's eqiad [08:22:23] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:988493|testwiki: Enable Section translation on WPs with potential to be supported with MinT using MADLAD-400 (T353510)]] (duration: 15m 54s) [08:22:27] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [08:24:16] that seems surprisingly astrological :) [08:24:38] metrics collection is back for k8s codfw, unfortunately the last ~12h are not available tho [08:24:57] great. thanks for the support! [08:25:30] you are welcome codders ! [08:25:35] (03PS2) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) [08:26:28] Sorry, I forgot to ping. _joe_ you can go ahead with config deployment [08:30:49] (03CR) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:31:51] (03CR) 10Hashar: [C: 03+1] "Ah great, thanks for the explanation!" [puppet] - 10https://gerrit.wikimedia.org/r/987779 (owner: 10Muehlenhoff) [08:35:07] <_joe_> urbanecm, Amir1 you around? [08:35:13] <_joe_> else I'll continue [08:35:38] (03CR) 10Hashar: [C: 03+1] "Excellent, and thank you for the extensive and well detailed inline comment!" [puppet] - 10https://gerrit.wikimedia.org/r/987961 (https://phabricator.wikimedia.org/T335354) (owner: 10Muehlenhoff) [08:35:42] (03CR) 10Slyngshede: "Move debmonitor to a subdirectory to make it easier to package." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [08:36:43] _joe_: around IRC-wise if I'm needed, at my keyboard in five. But feel free to continue either way :) [08:36:57] <_joe_> urbanecm: ack [08:37:48] (03PS1) 10KartikMistry: testwiki: Enable Section translation on WPs with Content Translation available as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988984 (https://phabricator.wikimedia.org/T351882) [08:39:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987031 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [08:40:24] (03Merged) 10jenkins-bot: Remove throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987031 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [08:40:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) @Aklapper Could this by why I have trouble logging into wikitech? Is there anything I can / need to do here? [08:40:35] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:987031|Remove throttle exception (T352569)]] [08:40:39] T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T352569 [08:40:48] (03CR) 10Ayounsi: k8s topology labels: add row to rack transition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [08:42:18] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:987031|Remove throttle exception (T352569)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:42:54] !log oblivian@deploy2002 oblivian: Continuing with sync [08:43:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, one suggestion inline." [puppet] - 10https://gerrit.wikimedia.org/r/761397 (https://phabricator.wikimedia.org/T296533) (owner: 10Majavah) [08:44:30] (03Abandoned) 10Majavah: P:ceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/801380 (owner: 10Majavah) [08:45:41] (03PS1) 10Muehlenhoff: Remove obsolete Hiera files [puppet] - 10https://gerrit.wikimedia.org/r/989086 (https://phabricator.wikimedia.org/T296533) [08:45:50] (03PS2) 10Muehlenhoff: Remove obsolete Hiera files [puppet] - 10https://gerrit.wikimedia.org/r/989086 (https://phabricator.wikimedia.org/T296533) [08:47:03] (03PS2) 10Stevemunene: druid: remove druid100[4-6] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/974120 (https://phabricator.wikimedia.org/T336043) [08:47:11] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 9902 [08:47:51] (03PS2) 10Urbanecm: beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) [08:48:12] (03CR) 10CI reject: [V: 04-1] druid: remove druid100[4-6] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/974120 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [08:48:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 9902 [08:49:36] (03PS3) 10Stevemunene: druid: remove druid100[4-6] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/974120 (https://phabricator.wikimedia.org/T336043) [08:49:37] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:987031|Remove throttle exception (T352569)]] (duration: 09m 01s) [08:49:41] T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T352569 [08:50:08] (03CR) 10Urbanecm: [C: 04-2] beta: Enable conditional defaults for 4 Echo properties (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [08:50:10] (03PS2) 10Urbanecm: beta: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) [08:50:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:50:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:53:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:53:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:54:05] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp [08:54:12] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 45287 [08:54:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [08:54:38] (03CR) 10CI reject: [V: 04-1] Use shellbox for djvu handling on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [08:54:57] (03PS1) 10Majavah: conftool-data: Remove wiki replica dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/989087 (https://phabricator.wikimedia.org/T346947) [08:58:12] (03PS2) 10Majavah: Move dbproxy1018/9 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/988681 (https://phabricator.wikimedia.org/T346947) [08:58:14] (03PS1) 10Majavah: mariadb: remove grants and firewall rules for dbproxy1018/9 [puppet] - 10https://gerrit.wikimedia.org/r/989088 (https://phabricator.wikimedia.org/T346947) [08:59:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 45287 [08:59:31] (03PS2) 10Giuseppe Lavagetto: Use shellbox for djvu handling on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) [08:59:33] (03PS2) 10Giuseppe Lavagetto: Always process media files via shellbox on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987033 (https://phabricator.wikimedia.org/T352515) [08:59:35] (03PS2) 10Giuseppe Lavagetto: Explicitly disable all local imagescaling on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) [09:01:32] (03CR) 10Muehlenhoff: [C: 03+2] ncredir: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/984836 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [09:02:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [09:02:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] [vrts] Adjust restart and oom policy for clamav and vrts services [puppet] - 10https://gerrit.wikimedia.org/r/988739 (https://phabricator.wikimedia.org/T354478) (owner: 10EoghanGaffney) [09:03:12] (03Merged) 10jenkins-bot: Use shellbox for djvu handling on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [09:03:25] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:987032|Use shellbox for djvu handling on kubernetes (T352515)]] [09:03:30] T352515: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 [09:05:17] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:987032|Use shellbox for djvu handling on kubernetes (T352515)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:06:13] !log upload wmfdb 0.1.4 from https://gitlab.wikimedia.org/repos/sre/wmfdb/-/tree/dgit/bookworm-wikimedia to fix default ca bundle [09:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:39] (03CR) 10Urbanecm: beta: Temporarily change default value for 4 Echo properties (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [09:11:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp [09:14:00] !log prune obsolete nginx packages from ncredir hosts after migration to new library scheme T329529 [09:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:04] T329529: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 [09:15:25] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp [09:17:21] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff) [09:20:46] !log oblivian@deploy2002 oblivian: Continuing with sync [09:20:58] (03CR) 10Volans: "One comment inline, in general I'm ok with the idea. Have you tested that this is still working fine and all internal imports will work wi" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [09:21:10] (03PS1) 10Muehlenhoff: eventschemas: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/989090 (https://phabricator.wikimedia.org/T329529) [09:24:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989090 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [09:27:22] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:987032|Use shellbox for djvu handling on kubernetes (T352515)]] (duration: 23m 56s) [09:27:26] T352515: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 [09:32:19] (03PS1) 10Giuseppe Lavagetto: mw-debug: sync settings with main cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/989092 [09:34:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10Aklapper) >>! In T354049#9444602, @ArthurTaylor wrote: > @Aklapper Could this by why I have trouble logging into wikitech? If you have "trouble logging into wi... [09:34:39] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp [09:37:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) Re how this account was created, I don't know enough about account creation to answer that. I just opened this ticket, and otherwise I was on-boar... [09:39:12] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [09:39:22] (03CR) 10JMeybohm: [C: 03+1] mw-debug: sync settings with main cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/989092 (owner: 10Giuseppe Lavagetto) [09:40:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-debug: sync settings with main cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/989092 (owner: 10Giuseppe Lavagetto) [09:40:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987033 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [09:40:57] (03Merged) 10jenkins-bot: mw-debug: sync settings with main cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/989092 (owner: 10Giuseppe Lavagetto) [09:42:46] (03Merged) 10jenkins-bot: Always process media files via shellbox on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987033 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [09:42:59] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:987033|Always process media files via shellbox on k8s (T352515)]] [09:43:03] T352515: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 [09:44:41] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:987033|Always process media files via shellbox on k8s (T352515)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:46:41] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:46:51] (03PS2) 10D3r1ck01: wmf-config: Remove unused wgCentralAuthTokenCacheType [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988403 (https://phabricator.wikimedia.org/T336004) [09:47:23] !log oblivian@deploy2002 oblivian: Continuing with sync [09:47:58] (03PS19) 10Stevemunene: C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:48:52] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2033/2034 move - ayounsi@cumin1002" [09:52:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2033/2034 move - ayounsi@cumin1002" [09:52:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:02] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:987033|Always process media files via shellbox on k8s (T352515)]] (duration: 11m 03s) [09:54:06] T352515: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 [09:54:52] (03CR) 10Muehlenhoff: [C: 03+2] deployment servers: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987779 (owner: 10Muehlenhoff) [09:59:02] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti2033.codfw.wmnet with OS bookworm [10:00:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [10:03:28] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans As this ticket is few months old... [10:05:49] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10Volans) Is this request still current or given John's explanation could it be closed as declined? [10:09:41] (03PS1) 10JMeybohm: prometheus::k8s: Drop id label from cadvisor metrics [puppet] - 10https://gerrit.wikimedia.org/r/989096 (https://phabricator.wikimedia.org/T354604) [10:09:53] (03PS1) 10Slyngshede: Ganeti memory preassure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) [10:11:29] !log btullis@cumin1002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [10:12:00] (03CR) 10Btullis: [C: 03+2] Disable monitoring on dbstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/987420 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [10:15:27] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/984163 (owner: 10Muehlenhoff) [10:16:12] (03PS1) 10Santiago Faci: Deploying to staging to test the fix with production data [deployment-charts] - 10https://gerrit.wikimedia.org/r/989098 (https://phabricator.wikimedia.org/T354074) [10:19:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [10:19:18] (03CR) 10Santiago Faci: [V: 03+2 C: 03+2] "Self merging to deploy to staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989098 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [10:19:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/989096 (https://phabricator.wikimedia.org/T354604) (owner: 10JMeybohm) [10:19:55] Hey folks, I noticed my change was skipped since I was not around. I cannot guarantee I can join during a specific deploy window, is that a hard requirement? [10:20:22] (03Merged) 10jenkins-bot: Deploying to staging to test the fix with production data [deployment-charts] - 10https://gerrit.wikimedia.org/r/989098 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [10:20:28] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) @fgiunchedi I noticed the pontoon name in the logs, so I guess you're running it in an environment where debmonitor is not present. So instead o... [10:20:49] (03CR) 10David Caro: grid: disable hardcoded memory overcmommit on weblight (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983139 (owner: 10David Caro) [10:21:02] (03PS1) 10Btullis: Switch staging-db-analytics.eqiad.wmnet to dbstore1009 [dns] - 10https://gerrit.wikimedia.org/r/989099 (https://phabricator.wikimedia.org/T351924) [10:21:04] (03PS1) 10Btullis: Switch s8-analytics-replica.eqiad.wmnet to dbstore1009 [dns] - 10https://gerrit.wikimedia.org/r/989100 (https://phabricator.wikimedia.org/T351924) [10:21:06] (03PS1) 10Btullis: Switch s6-analytics-replica.eqiad.wmnet to dbstore1009 [dns] - 10https://gerrit.wikimedia.org/r/989101 (https://phabricator.wikimedia.org/T351924) [10:21:08] (03PS1) 10Btullis: Switch x1-analytics-replica.eqiad.wmnet to dbstore1009 [dns] - 10https://gerrit.wikimedia.org/r/989102 (https://phabricator.wikimedia.org/T351924) [10:21:45] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1050/co" [puppet] - 10https://gerrit.wikimedia.org/r/989096 (https://phabricator.wikimedia.org/T354604) (owner: 10JMeybohm) [10:21:50] !log sfaci@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [10:22:06] !log sfaci@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [10:25:04] 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10fgiunchedi) [10:26:51] (03PS1) 10Muehlenhoff: rsync::quickdatacopy: Remove use_generic_firewall and auto_ferm_ipv6 flags [puppet] - 10https://gerrit.wikimedia.org/r/989103 [10:31:00] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus::k8s: Drop id label from cadvisor metrics [puppet] - 10https://gerrit.wikimedia.org/r/989096 (https://phabricator.wikimedia.org/T354604) (owner: 10JMeybohm) [10:35:46] (03PS1) 10Btullis: Enable monitoring for dbstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/989104 (https://phabricator.wikimedia.org/T351924) [10:35:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989103 (owner: 10Muehlenhoff) [10:35:50] (03PS1) 10Btullis: Disable monitoring for dbstore1005 [puppet] - 10https://gerrit.wikimedia.org/r/989105 (https://phabricator.wikimedia.org/T351924) [10:38:32] !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [10:39:54] (03CR) 10Marostegui: [C: 03+1] Disable monitoring for dbstore1005 [puppet] - 10https://gerrit.wikimedia.org/r/989105 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis) [10:43:34] (03CR) 10FNegri: [C: 03+1] "LGTM, but please update https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Depool_wikireplicas#LVS_servers (and proba" [puppet] - 10https://gerrit.wikimedia.org/r/989087 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [10:44:15] (03CR) 10Muehlenhoff: [C: 03+2] piwik: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/984163 (owner: 10Muehlenhoff) [10:44:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10Volans) [10:45:27] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ganeti2033.codfw.wmnet with OS bookworm [10:45:59] (03CR) 10FNegri: [C: 03+1] "Leaving a +1 in case you want to test it, otherwise you can abandon this patch after the grid is shut down." [puppet] - 10https://gerrit.wikimedia.org/r/983139 (owner: 10David Caro) [10:47:27] (03CR) 10Majavah: [C: 03+2] conftool-data: Remove wiki replica dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/989087 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [10:50:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff) [10:54:20] !log restart prometheus@k8s on prometheus1005 to see if labeldrop id will yield expected results - T354604 [10:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:24] T354604: Investigate prometheus@k8s metric/label cardinality reduction - https://phabricator.wikimedia.org/T354604 [10:56:13] (03CR) 10Btullis: [C: 03+2] Disable monitoring for dbstore1005 [puppet] - 10https://gerrit.wikimedia.org/r/989105 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis) [10:56:23] PROBLEM - memcached socket on mw2394 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T1100) [11:03:50] (03CR) 10Majavah: [C: 03+1] "few nits, otherwise looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [11:05:51] !log installing exim security updates [11:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:21] (03PS1) 10Hnowlan: jobqueue: restore media handling jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989128 (https://phabricator.wikimedia.org/T352515) [11:08:17] (03PS1) 10Lucas Werkmeister (WMDE): Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/989125 (https://phabricator.wikimedia.org/T255706) [11:10:07] I’m going to deploy ^ this debug code on mwdebug2001 for a moment; please avoid syncing wmf.12 from deploy2002 until I’m done ^^ [11:10:15] (03CR) 10Clément Goubert: [C: 03+1] jobqueue: restore media handling jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989128 (https://phabricator.wikimedia.org/T352515) (owner: 10Hnowlan) [11:10:16] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:29] (if an urgent deployment is necessary, it should be okay-ish, there’ll just be some logspam at warning severity) [11:10:35] jouncebot: now [11:10:35] For the next 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T1100) [11:10:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] jobqueue: restore media handling jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989128 (https://phabricator.wikimedia.org/T352515) (owner: 10Hnowlan) [11:12:16] (03Abandoned) 10Lucas Werkmeister (WMDE): Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984848 (https://phabricator.wikimedia.org/T255706) (owner: 10Lucas Werkmeister (WMDE)) [11:13:00] (03PS6) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [11:13:06] pulled to mwdebug2001 [11:13:29] (03CR) 10FNegri: dologmsg: standardize logging format (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [11:13:49] and original wmf.12 code restored again [11:14:00] I’ll look at what that did to logstash [11:14:06] (03CR) 10Hnowlan: [C: 03+2] jobqueue: restore media handling jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989128 (https://phabricator.wikimedia.org/T352515) (owner: 10Hnowlan) [11:14:15] deployments unblocked as far as I’m concerned, though I might come back and test some more debug code, who knows [11:14:40] (03CR) 10Btullis: [C: 03+1] "This looks good. Perhaps we should stagger the deploy by disabling puppet and depooling a host before running puppet to update it, then po" [puppet] - 10https://gerrit.wikimedia.org/r/989090 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [11:14:52] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [11:14:59] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3 [11:15:15] (03Merged) 10jenkins-bot: jobqueue: restore media handling jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989128 (https://phabricator.wikimedia.org/T352515) (owner: 10Hnowlan) [11:15:23] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3 [11:16:07] lol, I forgot to actually enable WikimediaDebug [11:16:11] so I’ll just do the same thing again [11:16:29] scap pulled debug code to mwdebug2001 again [11:17:03] (03PS7) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [11:17:03] and wmf.12 restored again [11:17:08] and scap pulled on mwdebug2001 again [11:17:34] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10MatthewVernon) Hi. I'm the clinician on duty this week. I'm afraid I'm not quite clear what sort of access you are requesting here (ml-staging-codfw is... [11:17:41] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:17:57] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:18:15] (03PS8) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [11:18:17] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:18:43] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:19:04] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:19:35] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:19:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1014.eqiad.wmnet with OS bullseye [11:22:38] !log cgoubert@cumin2002 conftool action : set/pooled=no; selector: name=mw2394.codfw.wmnet [11:25:43] RECOVERY - memcached socket on mw2394 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [11:29:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1180 T354506', diff saved to https://phabricator.wikimedia.org/P54568 and previous config saved to /var/cache/conftool/dbconfig/20240109-112922-root.json [11:29:26] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [11:30:06] (03PS1) 10Marostegui: db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989129 (https://phabricator.wikimedia.org/T354506) [11:30:08] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=mw2394.codfw.wmnet,cluster=jobrunner [11:30:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1180.eqiad.wmnet with OS bookworm [11:31:17] (03CR) 10Marostegui: [C: 03+2] db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989129 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [11:32:21] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1014.eqiad.wmnet with reason: host reimage [11:32:41] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) mw2394 squared up and repooled, set back in active in Netbox [11:35:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1014.eqiad.wmnet with reason: host reimage [11:36:49] (03PS2) 10Lucas Werkmeister (WMDE): Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/989125 (https://phabricator.wikimedia.org/T255706) [11:37:03] ^ going to test PS2 on mwdebug2001 again [11:37:08] (03CR) 10David Caro: [C: 03+1] "LGTM, might want to migrate it to an `epp` template (better parameter/variable handling, static type checking, ...)" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [11:37:10] (ping me if I should stop or anything) [11:37:42] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on lsw1-b8-codfw,lsw1-b8-codfw IPv6 with reason: Adding vlan to switch, precaution in case it triggers EVPN L3 bug. [11:37:48] !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [11:37:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lsw1-b8-codfw,lsw1-b8-codfw IPv6 with reason: Adding vlan to switch, precaution in case it triggers EVPN L3 bug. [11:38:05] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c70b0979-84e8-4fe7-8682-45d50615a587) set by cmooney@cumin1002 f... [11:38:29] scap pulled to mwdebug2001 [11:38:49] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [11:39:00] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:39:16] and restored wmf.12 again [11:39:52] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [11:40:20] (03CR) 10Stevemunene: [C: 03+2] C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [11:41:28] (03CR) 10Btullis: [C: 03+2] Switch staging-db-analytics.eqiad.wmnet to dbstore1009 [dns] - 10https://gerrit.wikimedia.org/r/989099 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis) [11:42:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1180.eqiad.wmnet with reason: host reimage [11:43:31] !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [11:44:35] (03CR) 10MVernon: [C: 04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/988728 (https://phabricator.wikimedia.org/T354227) (owner: 10Eevans) [11:46:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: host reimage [11:46:29] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:47:03] (03CR) 10Volans: [C: 04-1] "Some minor issues/comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [11:49:19] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update dns entry for kubestage2002.codfw.wmnet - cmooney@cumin1002" [11:50:15] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [11:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [11:50:29] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [11:50:32] (03CR) 10Btullis: [C: 03+2] Enable monitoring for dbstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/989104 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis) [11:50:49] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update dns entry for kubestage2002.codfw.wmnet - cmooney@cumin1002" [11:50:49] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:51:42] (03PS1) 10Majavah: wikireplicas: update-views: always run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) [11:53:01] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:54:05] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [11:55:17] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:30:06] !log samtar@deploy2002 samtar and anzx: Backport for [[gerrit:968318|Create draft namespace and add namespaces aliases for hewikinews (T349581)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:30:09] TheresNoTime: checking [14:30:16] ack [14:30:33] (03PS5) 10Samtar: zghwiki: add metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986659 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [14:31:58] TheresNoTime: looks good [14:32:03] !log samtar@deploy2002 samtar and anzx: Continuing with sync [14:32:48] (03PS4) 10Samtar: bjnwikiquote: add metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986660 (https://phabricator.wikimedia.org/T350235) (owner: 10Anzx) [14:33:51] might try combining the other two patches' deploy.. [14:34:00] TheresNoTime: ok [14:34:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [14:34:19] (03CR) 10Muehlenhoff: "Does this only list groups below ou=groups or all groups? The latter would also list the various OpenStack-internal ones which have no rea" [puppet] - 10https://gerrit.wikimedia.org/r/987120 (https://phabricator.wikimedia.org/T354069) (owner: 10Hashar) [14:34:39] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host snapshot1014.eqiad.wmnet [14:34:47] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) ` # kubectl describe nodes kubestage2002.codfw.wmnet | grep -A3 Addresses Addresses: InternalIP: 10.192.22.13... [14:35:00] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10akosiaris) >>! In T352883#9445907, @ayounsi wrote: >> This is the thing we need to get fixed, I see > Yep, that's {T352893} and i... [14:35:24] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:36:01] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:38:22] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:968318|Create draft namespace and add namespaces aliases for hewikinews (T349581)]] (duration: 10m 05s) [14:38:25] TheresNoTime: can you run namespacedupes.php for hewikinews [14:38:36] T349581: Create draft namespace and add namespaces aliases for hewikinews - https://phabricator.wikimedia.org/T349581 [14:38:49] !log `[samtar@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki hewikinews --fix` T349581 [14:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:53] anzx: maybe unexpected result [14:40:00] https://phabricator.wikimedia.org/T349581#9446057 [14:40:08] `id=9133 ns=0 dbk=תב:עדכונים *** dest title exists and --add-prefix not specified` [14:41:31] TheresNoTime: re-run it with `--add-prefix "BROKEN "`, then tell the requestor in the task to deal with the now-useless page [14:41:39] taavi: ack, thank you [14:41:50] (03PS1) 10Kamila Součková: TEMPORARY for debugging T354413: add role hiera [puppet] - 10https://gerrit.wikimedia.org/r/989192 (https://phabricator.wikimedia.org/T354413) [14:42:07] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - ayounsi@cumin1002" [14:42:41] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) >>! In T352974#9446004, @MoritzMuehlenhoff wrote: >>>! In T352974#9445831, @ABran-WMF wrote: >> @MoritzMuehlenhoff I'm currently tryi... [14:42:54] anzx: see https://phabricator.wikimedia.org/T349581#9446070, one page you'll need to deal with :) moving on to the other patches [14:43:15] Ok, I will let local admin know [14:43:22] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) (owner: 10Muehlenhoff) [14:43:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986659 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [14:43:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986660 (https://phabricator.wikimedia.org/T350235) (owner: 10Anzx) [14:43:44] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [14:43:46] (03CR) 10Herron: [C: 03+1] udp2log: amend demux.py to support the python3 runtime [puppet] - 10https://gerrit.wikimedia.org/r/984237 (https://phabricator.wikimedia.org/T353220) (owner: 10Cwhite) [14:43:47] that one page is https://he.wikinews.org/w/index.php?title=%D7%AA%D7%91%D7%A0%D7%99%D7%AA:BROKEN_%D7%A2%D7%93%D7%9B%D7%95%D7%A0%D7%99%D7%9D&redirect=no, and likely can just be deleted [14:43:55] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:11] (03Merged) 10jenkins-bot: zghwiki: add metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986659 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [14:44:40] (03PS5) 10Samtar: bjnwikiquote: add metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986660 (https://phabricator.wikimedia.org/T350235) (owner: 10Anzx) [14:44:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - ayounsi@cumin1002" [14:45:00] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2034.codfw.wmnet with OS bookworm [14:45:28] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) I see these entries in the logs from orchestrator. ` Jan 09 14:43:15 dborch1001 orchestrator[3587041]: 2024-01-09 14:43:15 WARNING Disc... [14:45:44] (03CR) 10TrainBranchBot: "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986660 (https://phabricator.wikimedia.org/T350235) (owner: 10Anzx) [14:46:27] (03Merged) 10jenkins-bot: bjnwikiquote: add metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986660 (https://phabricator.wikimedia.org/T350235) (owner: 10Anzx) [14:46:43] !log samtar@deploy2002 Started scap: Backport for [[gerrit:986659|zghwiki: add metanamespace (T350241)]], [[gerrit:986660|bjnwikiquote: add metanamespace (T350235)]] [14:46:57] T350241: Post-creation work for zghwiki - https://phabricator.wikimedia.org/T350241 [14:46:57] T350235: Post-creation work for bjnwikiquote - https://phabricator.wikimedia.org/T350235 [14:48:00] (03PS10) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [14:48:12] (KubernetesCalicoDown) firing: mw1378.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1378.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:48:26] (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Fix values broken yaml structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/989141 (https://phabricator.wikimedia.org/T347027) (owner: 10Jgiannelos) [14:48:33] I'm going to get a failure on `mw1378` aren't I.. [14:48:46] (03CR) 10FNegri: dologmsg: standardize logging format (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [14:49:08] (03CR) 10CI reject: [V: 04-1] dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [14:49:13] * TheresNoTime waits for it to timeout [14:50:02] yup [14:50:34] !log samtar@deploy2002 samtar and anzx: Backport for [[gerrit:986659|zghwiki: add metanamespace (T350241)]], [[gerrit:986660|bjnwikiquote: add metanamespace (T350235)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:50:36] anzx: both of those are ready for testing :) [14:50:37] TheresNoTime: checking [14:50:39] (03PS11) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [14:50:55] (03CR) 10Stevemunene: druid: remove druid100[4-6] from druid_public_broker VIP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974120 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [14:51:01] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Fix values broken yaml structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/989141 (https://phabricator.wikimedia.org/T347027) (owner: 10Jgiannelos) [14:52:10] (03Merged) 10jenkins-bot: wikifeeds: Fix values broken yaml structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/989141 (https://phabricator.wikimedia.org/T347027) (owner: 10Jgiannelos) [14:52:38] TheresNoTime: looks good [14:52:43] !log samtar@deploy2002 samtar and anzx: Continuing with sync [14:53:07] (03CR) 10Ladsgroup: production.sql.erb: Add cumin1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989144 (https://phabricator.wikimedia.org/T352974) (owner: 10Marostegui) [14:53:50] (03CR) 10Marostegui: [C: 03+2] production.sql.erb: Add cumin1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989144 (https://phabricator.wikimedia.org/T352974) (owner: 10Marostegui) [14:56:02] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:56:16] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9446072, @ABran-WMF wrote: >>>! In T352974#9446004, @MoritzMuehlenhoff wrote: >>>>! In T352974#9445831, @ABran... [14:56:35] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [14:58:54] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:986659|zghwiki: add metanamespace (T350241)]], [[gerrit:986660|bjnwikiquote: add metanamespace (T350235)]] (duration: 12m 10s) [14:58:56] running scripts [14:59:01] T350241: Post-creation work for zghwiki - https://phabricator.wikimedia.org/T350241 [14:59:02] T350235: Post-creation work for bjnwikiquote - https://phabricator.wikimedia.org/T350235 [14:59:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:20] !log `[samtar@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki zghwiki --add-prefix "BROKEN " --fix` T350241 [14:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:45] TheresNoTime: can you run `echo 'https://en.wikipedia.org/static/images/mobile/copyright/wikinews-wordmark-zh.svg' | mwscript purgeList.php ` for T353792 it seems cache for wordmark was not cleared [14:59:46] T353792: Logo of zh WikiNews has background color instead of alpha channel (visible in Minerva) - https://phabricator.wikimedia.org/T353792 [15:00:05] !log `[samtar@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki bjnwikiquote --add-prefix "BROKEN " --fix` T350235 [15:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:53] anzx: ack [15:01:07] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [15:01:16] !log `[samtar@mwmaint2002 ~]$ echo 'https://en.wikipedia.org/static/images/mobile/copyright/wikinews-wordmark-zh.svg' | mwscript purgeList.php` T353792 [15:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:26] (done) [15:01:34] TheresNoTime: thank you [15:02:01] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [15:02:03] anzx: could you check bjnwikiquote again please? I got an error running namespaceDupes (but it seems to have worked anyway..) [15:02:13] checking [15:02:33] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [15:03:08] (03CR) 10Ssingh: [V: 03+2 C: 03+2] druid: remove druid100[4-6] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/974120 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [15:03:13] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [15:03:26] https://www.irccloud.com/pastebin/qRM3SpUz/ [15:04:45] (03CR) 10Cathal Mooney: k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [15:05:48] TheresNoTime: page seems to in moved in project namespace correctly [15:06:37] ack, okay :) [15:06:48] !log done UTC afternoon backport window [15:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:15] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/988739/1052/vrts1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/988739 (https://phabricator.wikimedia.org/T354478) (owner: 10EoghanGaffney) [15:10:45] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:11:12] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: move cirrusSearchElasticaWrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/989133 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:12:17] (03Merged) 10jenkins-bot: changeprop-jobqueue: move cirrusSearchElasticaWrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/989133 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:12:53] T354652, creeping up steadily.. [15:12:54] T354652: JobQueueError: Could not enqueue jobs - https://phabricator.wikimedia.org/T354652 [15:13:12] (KubernetesCalicoDown) resolved: mw1378.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1378.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:14:23] !log restart pybal on lvs1020: T336043 [15:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:32] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [15:15:37] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:16:00] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:16:03] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:21] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:16:43] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:17:17] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [15:19:10] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [15:19:12] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10calbon) a:03klausman [15:19:42] !log restart pybal on lvs1019: T336043 [15:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:46] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [15:25:06] 10SRE, 10Machine-Learning-Team: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10klausman) I think we can remove the SRE-Access-Requests tag, since this likely can be entirely covered on the k8s permission level. [15:25:25] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:12] (KubernetesCalicoDown) firing: mw1378.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1378.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:33:37] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [15:35:11] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10BTullis) Just for reference, I think that we are still undecided on whether this roll-back is necessary, or whether we will be able to //fix... [15:36:00] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10BTullis) 05Open→03Stalled p:05Triage→03High [15:36:08] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) [15:36:14] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [15:37:39] !log depool and reboot mw1349 for a test T354413 [15:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:42] T354413: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 [15:39:12] (KubernetesCalicoDown) resolved: mw1378.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1378.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:39:25] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:31] (03PS1) 10Majavah: wikimediacloud.org: point rabbitmq03.eqiad1 to cloudrabbit1002 [dns] - 10https://gerrit.wikimedia.org/r/989196 (https://phabricator.wikimedia.org/T345610) [15:44:19] 10SRE, 10Machine-Learning-Team: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10isarantopoulos) [15:44:38] (03PS2) 10Dzahn: phabricator: move data syncing related code to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/988107 (https://phabricator.wikimedia.org/T354221) [15:44:39] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) In case it helps, this is also a useful command for showing the certificate chain that is presented by the dbstore servers. ` btullis@d... [15:45:30] (03CR) 10Andrew Bogott: [C: 03+1] "I have learned to make no predictions about how rabbitmq client will behave, but let's see if this works!" [dns] - 10https://gerrit.wikimedia.org/r/989196 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [15:45:58] (03CR) 10Majavah: [C: 03+2] wikimediacloud.org: point rabbitmq03.eqiad1 to cloudrabbit1002 [dns] - 10https://gerrit.wikimedia.org/r/989196 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [15:46:37] (03CR) 10Dzahn: [C: 03+2] phabricator: quarterly_metrics.sh: Correct Year output [puppet] - 10https://gerrit.wikimedia.org/r/987784 (owner: 10Aklapper) [15:48:46] (03CR) 10Dzahn: [C: 03+2] phabricator: Yearly metrics for wikitech-l: Correct strings [puppet] - 10https://gerrit.wikimedia.org/r/987140 (owner: 10Aklapper) [15:49:23] (03PS1) 10Majavah: Revert "P:toolforge::mailrelay: rewrite maintainers in Python" [puppet] - 10https://gerrit.wikimedia.org/r/989154 [15:49:28] (03CR) 10Majavah: [V: 03+2 C: 03+2] Revert "P:toolforge::mailrelay: rewrite maintainers in Python" [puppet] - 10https://gerrit.wikimedia.org/r/989154 (owner: 10Majavah) [15:50:12] (03PS12) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [15:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:54:33] !log restart prometheus@k8s on prometheus1005 with GOGC=60 - T354604 [15:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:37] T354604: Investigate prometheus@k8s metric/label cardinality reduction - https://phabricator.wikimedia.org/T354604 [15:54:57] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1053/console" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [15:56:11] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [15:57:30] (03CR) 10Eevans: [C: 03+2] restbase: partitioning and insetup for restbase10[34-42] [puppet] - 10https://gerrit.wikimedia.org/r/988728 (https://phabricator.wikimedia.org/T354227) (owner: 10Eevans) [15:58:29] !log taavi@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudrabbit1003.wikimedia.org [16:00:05] eoghan, jelto, and arnoldokoth: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T1600). [16:03:41] ^ I am on this with Brennen. Phabricator patch deployment [16:04:10] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [16:07:41] (03CR) 10Clément Goubert: [C: 03+1] k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [16:07:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: deployment [16:07:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: deployment [16:08:01] (03CR) 10Clément Goubert: k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [16:08:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: deployment [16:08:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: deployment [16:08:50] !log brennen@deploy2002 Started deploy [phabricator/deployment@369e797]: deploy to phab2002 for T354545 [16:08:54] T354545: Deploy Phabricator/Phorge 2024-01-09 - https://phabricator.wikimedia.org/T354545 [16:09:04] !log phabricator deployment in progress [16:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:08] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudrabbit1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - taavi@cumin1002" [16:09:46] !log brennen@deploy2002 Finished deploy [phabricator/deployment@369e797]: deploy to phab2002 for T354545 (duration: 00m 55s) [16:10:24] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudrabbit1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - taavi@cumin1002" [16:10:24] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:25] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudrabbit1003.wikimedia.org [16:10:25] !log brennen@deploy2002 Started deploy [phabricator/deployment@369e797]: deploy to phab1004 for T354545 [16:10:41] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:10:51] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1002 for hosts: `cloudrabbit1003.wikimedia.org` -... [16:11:22] !log brennen@deploy2002 Finished deploy [phabricator/deployment@369e797]: deploy to phab1004 for T354545 (duration: 00m 56s) [16:12:04] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) >>! In T345610#9409823, @taavi wrote: > This can move forward now, although due to the nature of Rabbit this needs to b... [16:12:25] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) [16:13:11] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:21:19] (03CR) 10Cathal Mooney: k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [16:22:58] !log phabricator - differential has been disabled (T330797) [16:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:12] T330797: Uninstall Differential (Phabricator application) - https://phabricator.wikimedia.org/T330797 [16:24:55] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:21] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) a:05taavi→03None [16:26:47] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [16:26:57] !log restart prometheus@k8s on prometheus1005 revert GOGC to 100 (default) - T354604 [16:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:16] T354604: Investigate prometheus@k8s metric/label cardinality reduction - https://phabricator.wikimedia.org/T354604 [16:34:08] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:12] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:35:18] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:43] (03CR) 10Ayounsi: k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [16:35:50] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [16:35:53] 10SRE, 10ops-codfw, 10DBA: db2143 not rebooting - https://phabricator.wikimedia.org/T354593 (10Jhancock.wm) server found rebooting. drained power. no results. opened chassis and removed all pci cards. still rebooted. removed all but the primary ram on both cpus. boots now without issue. locating bad ram a... [16:38:00] (03PS13) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [16:40:06] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [16:40:42] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:12] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [16:41:51] 1378 I am assuming is "known"; 1349 seems to have some actiivity today but should it be downtimed? [16:42:02] mw1378 and mw1349 [16:43:22] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:34] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:44:58] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:22] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 2.90 ms [16:46:23] sukhe: yeah akosiaris is working on them [16:46:39] thanks as usual claime [16:49:49] (03CR) 10Phuedx: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:50:28] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:31] yeah and I am finally I think on to something and it has the letter i s t i o for now all over it [16:51:08] not really, but it apparently is the trigger [16:51:52] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:52:16] last reboot and I am off for the day [16:53:10] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts ganeti-test[1001-1002].eqiad.wmnet [16:53:24] akosiaris: gl! [16:54:26] (03PS14) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [16:54:42] PROBLEM - Host mw1378 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:22] RECOVERY - Host mw1378 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [17:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T1700). [17:00:05] kostajh: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:22] it turns out I have an interview in this slot -- jhathaway are you around? [17:00:30] otherwise kostajh I'll catch up with you, sorry about that [17:00:34] yup, I'm around [17:00:51] 🙏 [17:01:58] (03CR) 10JHathaway: [C: 03+2] mediamoderation: Switch to using all.dblist [puppet] - 10https://gerrit.wikimedia.org/r/987983 (https://phabricator.wikimedia.org/T353703) (owner: 10Kosta Harlan) [17:02:11] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [17:03:53] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dnsleaks.py: add the --to-prometheus flag [puppet] - 10https://gerrit.wikimedia.org/r/987852 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [17:04:06] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: allow passing the 'edit-managed' flag to designate [puppet] - 10https://gerrit.wikimedia.org/r/987799 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [17:04:17] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti-test[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [17:04:26] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: use upstream designate clients [puppet] - 10https://gerrit.wikimedia.org/r/987800 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [17:04:33] (03CR) 10Andrew Bogott: [C: 03+2] Rename wmcs-novastats-dnsleaks to wmcs-dnsleaks [puppet] - 10https://gerrit.wikimedia.org/r/987851 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [17:04:56] (03CR) 10Tchanders: [C: 03+1] ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan) [17:05:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti-test[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [17:05:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:05:29] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti-test[1001-1002].eqiad.wmnet [17:06:15] (03CR) 10Cathal Mooney: k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [17:06:19] (03PS2) 10WMDE-Fisch: [beta] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) [17:06:21] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts ganeti-test2004.codfw.wmnet [17:06:40] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients.py: remove a use of project_name [puppet] - 10https://gerrit.wikimedia.org/r/988050 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [17:07:31] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [17:11:18] jhathaway: thanks! [17:12:01] (03CR) 10Awight: [C: 03+1] [beta] Allow Cite events for reference previews baseline stats (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [17:12:24] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [17:14:25] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti-test2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [17:15:07] kostajh: happy to help! [17:17:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti-test2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [17:17:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:17:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti-test2004.codfw.wmnet [17:20:58] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dnsleaks: Add prometheus metric [puppet] - 10https://gerrit.wikimedia.org/r/987853 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [17:21:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2143'] [17:22:57] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10wiki_willy) @Papaul / @Jhancock.wm and @Jclark-ctr / @VRiley-WMF - can you see if you have any spare memory onsite for Filippo? I think i... [17:24:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10wiki_willy) ++ @Jclark-ctr & @VRiley-WMF [17:25:46] (03PS1) 10Ayounsi: Remove mentions of ganeti-test1001/2 and 2004 [puppet] - 10https://gerrit.wikimedia.org/r/989212 (https://phabricator.wikimedia.org/T345602) [17:29:01] (03PS1) 10Btullis: Bring an-master1003 into service as a hadoop::master [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) [17:29:03] (03PS1) 10Btullis: Bring an-master1004 into service as a hadoop::standby [puppet] - 10https://gerrit.wikimedia.org/r/989214 (https://phabricator.wikimedia.org/T332573) [17:29:29] (03PS1) 10Andrew Bogott: openstack designate: fix path to wmcs-dnsleaks for prometheus metric [puppet] - 10https://gerrit.wikimedia.org/r/989215 (https://phabricator.wikimedia.org/T354365) [17:29:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989215 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [17:30:54] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1055/co" [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [17:31:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2143'] [17:33:02] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2143'] [17:33:07] (03PS2) 10Btullis: Bring an-master1003 into service as a hadoop::master [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) [17:33:09] (03PS2) 10Btullis: Bring an-master1004 into service as a hadoop::standby [puppet] - 10https://gerrit.wikimedia.org/r/989214 (https://phabricator.wikimedia.org/T332573) [17:34:08] (03CR) 10Andrew Bogott: [C: 03+2] openstack designate: fix path to wmcs-dnsleaks for prometheus metric [puppet] - 10https://gerrit.wikimedia.org/r/989215 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [17:36:59] (03PS1) 10Cyndywikime: Add account_conversion event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 [17:40:38] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10Jhancock.wm) at codfw we have two 32GB 2Rx4 PC4 2666V and one 32GB 2Rx4 PC4 2400V [17:41:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2143'] [17:41:53] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/977223 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [17:42:16] 10ops-codfw, 10ops-eqiad, 10Infrastructure-Foundations, 10Patch-For-Review: Repurpose three decom servers as temporary ganeti-test1001/1002 and ganeti-test2004 - https://phabricator.wikimedia.org/T345602 (10ayounsi) a:05ayounsi→03None Those 3 servers have been decommissioned. Over to DCops to finish th... [17:45:29] 10ops-eqiad, 10decommission-hardware: decommission ganeti-test1001, ganeti-test1002 - https://phabricator.wikimedia.org/T354680 (10RobH) [17:45:38] 10ops-eqiad, 10decommission-hardware: decommission ganeti-test1001, ganeti-test1002 - https://phabricator.wikimedia.org/T354680 (10RobH) [17:45:44] 10ops-codfw, 10ops-eqiad, 10Infrastructure-Foundations, 10Patch-For-Review: Repurpose three decom servers as temporary ganeti-test1001/1002 and ganeti-test2004 - https://phabricator.wikimedia.org/T345602 (10RobH) [17:46:04] 10SRE, 10ops-codfw, 10DBA: db2143 not rebooting - https://phabricator.wikimedia.org/T354593 (10Jhancock.wm) a:03Jhancock.wm B5 has been replaced. the bad DIMM has been labeled and separated from the stock. idrac and bios firmware were updated. server is up and pingable. let us know if any more issues po... [17:46:09] 10ops-codfw, 10decommission-hardware: decommission ganeti-test2004 - https://phabricator.wikimedia.org/T354681 (10RobH) [17:46:20] 10ops-codfw, 10decommission-hardware: decommission ganeti-test2004 - https://phabricator.wikimedia.org/T354681 (10RobH) [17:46:26] 10ops-codfw, 10ops-eqiad, 10Infrastructure-Foundations, 10Patch-For-Review: Repurpose three decom servers as temporary ganeti-test1001/1002 and ganeti-test2004 - https://phabricator.wikimedia.org/T345602 (10RobH) [17:47:37] (03PS1) 10Xcollazo: Add data1.usrdm1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/989217 (https://phabricator.wikimedia.org/T354679) [17:47:42] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T354580 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. will resolve when migrated to spine/leaf [17:48:57] 10ops-codfw, 10ops-eqiad, 10Infrastructure-Foundations, 10Patch-For-Review: Repurpose three decom servers as temporary ganeti-test1001/1002 and ganeti-test2004 - https://phabricator.wikimedia.org/T345602 (10RobH) 05Open→03Resolved a:03RobH T354680 and T354681 filed for decom [17:49:00] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10RobH) [17:49:41] 10ops-codfw, 10decommission-hardware: decommission ganeti-test2004 - https://phabricator.wikimedia.org/T354681 (10ayounsi) [17:49:55] 10ops-eqiad, 10decommission-hardware: decommission ganeti-test1001, ganeti-test1002 - https://phabricator.wikimedia.org/T354680 (10ayounsi) [17:50:01] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10fgiunchedi) [17:50:44] (03PS2) 10Xcollazo: Add data1.usrdm1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/989217 (https://phabricator.wikimedia.org/T354679) [17:51:17] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus200[56] - https://phabricator.wikimedia.org/T354685 (10fgiunchedi) [17:51:59] 10SRE, 10ops-codfw, 10Observability-Metrics: RAM upgrade for prometheus200[56] - https://phabricator.wikimedia.org/T354685 (10fgiunchedi) [17:52:02] (03CR) 10Xcollazo: "The intention is to merge this patch on Jan 19, 2024, at which an older mirror becomes inactive, and this newer mirror becomes active." [puppet] - 10https://gerrit.wikimedia.org/r/989217 (https://phabricator.wikimedia.org/T354679) (owner: 10Xcollazo) [17:52:26] (03PS1) 10Andrew Bogott: cloudservices: include openstack envscripts [puppet] - 10https://gerrit.wikimedia.org/r/989220 (https://phabricator.wikimedia.org/T354365) [17:52:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989220 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [17:53:09] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10fgiunchedi) [17:57:13] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices: include openstack envscripts [puppet] - 10https://gerrit.wikimedia.org/r/989220 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T1800) [18:00:58] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:01:07] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10Jhancock.wm) we also have eight 16GB 2Rv4 PC3L but I am not sure if they are compatible [18:02:26] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 44 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:02:46] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new reverse entries for mr1 -> lsw1-a2 link in codfw - cmooney@cumin1002" [18:04:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new reverse entries for mr1 -> lsw1-a2 link in codfw - cmooney@cumin1002" [18:04:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:18] (03PS1) 10Cathal Mooney: Add new include statement in IPv6 reverse zone for mr1-codfw lsw link [dns] - 10https://gerrit.wikimedia.org/r/989221 (https://phabricator.wikimedia.org/T348164) [18:07:58] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 9 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:14:22] (03PS1) 10Btullis: Add dummy keytabs for new hadoop master servers [labs/private] - 10https://gerrit.wikimedia.org/r/989222 (https://phabricator.wikimedia.org/T332573) [18:14:51] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy keytabs for new hadoop master servers [labs/private] - 10https://gerrit.wikimedia.org/r/989222 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [18:16:22] (03CR) 10Ssingh: [C: 03+1] Add new include statement in IPv6 reverse zone for mr1-codfw lsw link [dns] - 10https://gerrit.wikimedia.org/r/989221 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [18:16:43] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/988107/1057/" [puppet] - 10https://gerrit.wikimedia.org/r/988107 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [18:21:13] (03CR) 10Cathal Mooney: [C: 03+2] Add new include statement in IPv6 reverse zone for mr1-codfw lsw link [dns] - 10https://gerrit.wikimedia.org/r/989221 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [18:22:18] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) @papaul hey the link from mr1-codfw ge-0/0/3 to lsw1-a2-codfw ge-0/0/47 is now configured, but it's down both sides.... [18:22:26] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 26 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [18:23:29] jouncebot: now [18:23:29] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T1800) [18:27:09] running scap stage train to deploy to testwikis [18:28:00] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989223 (https://phabricator.wikimedia.org/T350089) [18:28:03] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989223 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [18:28:35] (03PS1) 10Cathal Mooney: Add BGP session between mr1-codfw and lsw1-a2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989224 (https://phabricator.wikimedia.org/T348164) [18:28:43] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989223 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [18:29:05] !log jhuneidi@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.13 refs T350089 [18:29:09] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [18:33:55] (03CR) 10Cathal Mooney: [C: 03+2] Add other vairant of QFX5120 to L3_SWITCHES_MODELS [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/989136 (https://phabricator.wikimedia.org/T306649) (owner: 10Cathal Mooney) [18:34:51] 10SRE, 10ops-codfw, 10Observability-Metrics: RAM upgrade for prometheus200[56] - https://phabricator.wikimedia.org/T354685 (10wiki_willy) a:03Jhancock.wm [18:36:38] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10wiki_willy) Awesome, thanks @Jhancock.wm. Here's the codfw upgrade ticket for you to coordinate with @fgiunchedi on the downtime - T35468... [18:38:28] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10bking) 05Open→03Resolved a:03bking [18:40:10] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1001-1002].eqiad.wmnet with reason: Release v0.6.5 - cmooney@cumin1002 [18:42:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1001-1002].eqiad.wmnet with reason: Release v0.6.5 - cmooney@cumin1002 [18:53:32] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/988107 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [18:57:33] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) > I was surprised we don't have the same need for public IPs there. We will at some point, the thought process was that I assigned both public/private for codf... [19:00:05] jeena and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T1900). [19:00:16] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:04:32] (03CR) 10Dzahn: "Do you have a link for reference where you found this?" [puppet] - 10https://gerrit.wikimedia.org/r/988679 (https://phabricator.wikimedia.org/T354484) (owner: 10AOkoth) [19:05:03] (03PS1) 10Andrew Bogott: cloudservices: include observer env [puppet] - 10https://gerrit.wikimedia.org/r/989228 (https://phabricator.wikimedia.org/T354365) [19:05:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989228 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [19:05:52] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:06:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:08:00] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service,httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:16] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices: include observer env [puppet] - 10https://gerrit.wikimedia.org/r/989228 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [19:11:22] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:13:18] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:14:54] !log jhuneidi@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.13 refs T350089 (duration: 45m 48s) [19:15:03] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [19:16:27] Will be rolling back train due to a blocker [19:16:47] !log decommissioning cassandra, restbase2013-{a,b,c} — T352469 [19:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:52] T352469: Decommission restbase20[13-20]) - https://phabricator.wikimedia.org/T352469 [19:17:15] (03PS1) 10TrainBranchBot: all wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989231 (https://phabricator.wikimedia.org/T350089) [19:17:17] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989231 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [19:18:09] (03Merged) 10jenkins-bot: all wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989231 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [19:25:59] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.42.0-wmf.12 refs T350089 [19:26:10] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [19:29:30] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: /srv 613 MB (1% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [19:30:08] oh, that is no good [19:30:14] if the mwmaint runs out of disk [19:30:47] existing issue for a long time that old MW versions are not auto-removed or so [19:31:11] php versions [19:31:40] also look at that jump in size: [19:31:41] 2.4G php-1.40.0-wmf.17 [19:31:41] 6.4G php-1.42.0-wmf.10 [19:31:50] now they are always over 6GB [19:32:16] jeena: do you know if we can delete an old one ^ ? [19:32:31] I'd presume any not on the deploy host are fair game [19:32:37] php-1.40.0-wmf.17 is pretty out of date by now. the wmf.10 is newer [19:32:48] And therefore anything < 1.42 is definitely good to go [19:32:54] should I nuke php-1.39.0-wmf.25 to start with [19:32:59] * Reedy grins [19:32:59] alright [19:33:11] mutante: I think it's okay to delete [19:33:17] The 1.39 one will be over a year old... [19:33:26] wat [19:33:37] !log mwmaint1002 - rm -rf /srv/mediawiki/php-1.39.0-wmf.25 after monitoring alerted about 99% disk usage on /srv [19:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:40] yeah, fine to destroy that one [19:34:01] down to 93% [19:34:09] will alert again at 95 I think [19:34:38] what's interesting is.. now each version uses more than 6B [19:34:45] before they were much smaller at 2.5 [19:35:01] There's some larger vendor libraries... which is a few hundred meg [19:35:34] !log mwmaint1002 - rm -rf /srv/mediawiki/php-1.40.0-wmf.17 [19:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:37] mutante: I think that's not an accurate comparison [19:35:46] It was only the cache dir left in that 1.40.0 version [19:35:51] ah, ok [19:35:52] 2.4G php-1.40.0-wmf.17/cache/l10n [19:36:18] 5.4G php-1.42.0-wmf.13/cache/ [19:36:21] though, that is some growth [19:36:34] /inconsistent cleanup [19:36:37] down to 87% but that is still not enough for even the next version [19:37:13] the two versions we need to keep are 1.42.0-wmf.12 and 1.42.0-wmf.13 [19:37:15] we can only have 5 versions at a time [19:37:25] with the current size of /srv [19:37:46] but they need pruning from the deploy host first ofc [19:38:01] deletes 42.0-wmf.7 [19:38:35] there, 12G free again, good for now [19:38:48] it's weird. the deploy host *doesn't* have all these versions. It has 1.42.0-wmf.{7,9,10,12,13} [19:38:49] but this was maintenance server, not deployment [19:39:17] I recall this issue on deployment servers in the past [19:39:22] maybe we fixed it there but never on mwmaint [19:39:24] some messed up permissions, so they didn't get removed properly? [19:40:11] do we want to scap clean .7 and .9 [19:40:26] mwmaint2002 also has 1.39 [19:40:38] heh [19:40:46] but for some reason it has 100GB free regardless [19:41:19] the problem, circa 2017 is rsync --delete doesn't delete files it doesn't own (so l10nupdate files), unsure if the problem is still the same or not. [19:41:21] (03PS1) 10Bking: wdqs: Enable TLS for test tier [puppet] - 10https://gerrit.wikimedia.org/r/989236 (https://phabricator.wikimedia.org/T354555) [19:41:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989236 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [19:42:33] https://phabricator.wikimedia.org/T130317 [19:43:07] https://phabricator.wikimedia.org/T119747 [19:43:21] "Clean up l10nupdate cache junk when scap clean is run" [19:43:32] does anythign run "scap clean" [19:43:46] I think it's done as part of the train conductor duties (ish) [19:44:56] !log mwmaint1002 - rm -rf 1.42.0-wmf.7 ; mwmamint2002 - rm -rf php-1.39.0-wmf.25 [19:44:56] (nothing in the SAL?) [19:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:14] I think scap clean is automated now, but may need some debugging (cc dancy ) [19:45:27] fancy [19:45:30] back in my day... [19:45:44] ^ that's basically all the information I have: back in my day :D [19:45:59] TBH, if it's a couple of random versions from .39 and .40, we shouldn't be too worried [19:46:06] if most of .41 was still there, I would be [19:46:20] (03CR) 10JHathaway: [C: 03+1] "looks good, any roles to double check with PCC?" [puppet] - 10https://gerrit.wikimedia.org/r/989086 (https://phabricator.wikimedia.org/T296533) (owner: 10Muehlenhoff) [19:46:32] It's possibly just some cleanup that was missed when improvements for removing newer stuff came in [19:46:45] back to 2014 https://phabricator.wikimedia.org/T73313 [19:47:21] "..close this resolved. This probably shouldn't be fully automated " [19:47:24] yeah, "scap clean" runs at the last stage of "scap stage-train" (You wrote that code thcipriani) which runs weekly. [19:47:40] heh [19:50:02] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [19:50:08] so maybe then it is...fixed for deployment* but same stuff should also apply to mwmaint [19:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [19:50:38] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/989103 (owner: 10Muehlenhoff) [19:51:43] (03CR) 10Andrew Bogott: [C: 03+2] team-wmcs: alert when stray DNS records appear in designate. [alerts] - 10https://gerrit.wikimedia.org/r/987858 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [19:57:06] (03PS1) 10Ryan Kemper: s/unssuported/unsupported [puppet] - 10https://gerrit.wikimedia.org/r/989238 [20:02:03] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989239 (https://phabricator.wikimedia.org/T350089) [20:02:05] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989239 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [20:02:50] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989239 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [20:03:13] !log jhuneidi@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.13 refs T350089 [20:03:23] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [20:04:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:04:14] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:32] (03PS1) 10Dzahn: phabricator: avoid duplicate list of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) [20:06:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:09:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:09:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:12:53] (03CR) 10Ryan Kemper: [C: 03+2] s/unssuported/unsupported [puppet] - 10https://gerrit.wikimedia.org/r/989238 (owner: 10Ryan Kemper) [20:14:40] huh, if it auto runs, then I'm at a loss for why there are still old versions lingering (unless we exclude mwmaint from the targets on that one) [20:18:10] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:18:18] (03PS1) 10Bking: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) [20:19:24] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [20:19:27] (03CR) 10CI reject: [V: 04-1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [20:23:06] (03PS2) 10Dzahn: phabricator: avoid duplicate list of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) [20:23:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:23:22] (03CR) 10CI reject: [V: 04-1] phabricator: avoid duplicate list of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [20:23:55] (03PS3) 10Dzahn: phabricator: avoid duplicate list of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) [20:25:09] (03CR) 10CI reject: [V: 04-1] phabricator: avoid duplicate list of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [20:26:00] (03PS4) 10Dzahn: phabricator: avoid duplicate list of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) [20:26:46] !log jhuneidi@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.13 refs T350089 (duration: 23m 33s) [20:26:53] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [20:28:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:30:23] (03CR) 10AOkoth: vrts: enable connection pooling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988679 (https://phabricator.wikimedia.org/T354484) (owner: 10AOkoth) [20:32:44] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989245 (https://phabricator.wikimedia.org/T350089) [20:32:48] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989245 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [20:33:31] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989245 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [20:38:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:40:49] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.13 refs T350089 [20:40:53] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [20:43:08] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "compiler show the only diff should be the parameter itself:" [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [20:46:01] (03CR) 10Hashar: "I think Volans already migrated some repositories to tox v4 and thus might be familiar with the differences compared to tox v3 ;)" [puppet] - 10https://gerrit.wikimedia.org/r/977223 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [20:46:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:48:41] !log about to deploy analytics/refinery - weekly train [20:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:12] 10SRE, 10Infrastructure-Foundations, 10Traffic: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10Tgr) [20:49:24] !log aqu@deploy2002 Started deploy [analytics/refinery@c4fed56]: Regular analytics weekly train [analytics/refinery@c4fed56c] [20:49:40] !log eevans@cumin1002 conftool action : set/weight=0; selector: cluster=restbase,dc=codfw,name=restbase2013.codfw.wmnet [20:49:42] (SystemdUnitFailed) firing: wdqs-updater.service Failed on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:49] !log eevans@cumin1002 conftool action : set/weight=0; selector: cluster=restbase,dc=codfw,name=restbase2014.codfw.wmnet [20:49:57] !log eevans@cumin1002 conftool action : set/weight=0; selector: cluster=restbase,dc=codfw,name=restbase2019.codfw.wmnet [20:51:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:53:27] (SystemdUnitFailed) resolved: (3) wdqs-updater.service Failed on wdqs2019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:30] !log aqu@deploy2002 Finished deploy [analytics/refinery@c4fed56]: Regular analytics weekly train [analytics/refinery@c4fed56c] (duration: 09m 06s) [20:58:55] !log aqu@deploy2002 Started deploy [analytics/refinery@c4fed56] (thin): Regular analytics weekly train THIN [analytics/refinery@c4fed56c] [20:59:01] !log aqu@deploy2002 Finished deploy [analytics/refinery@c4fed56] (thin): Regular analytics weekly train THIN [analytics/refinery@c4fed56c] (duration: 00m 06s) [20:59:18] !log aqu@deploy2002 Started deploy [analytics/refinery@c4fed56] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c4fed56c] [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240109T2100) [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:02:18] (03PS2) 10Dzahn: phabricator: avoid duplicate list of server names in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/988112 (https://phabricator.wikimedia.org/T354221) [21:02:42] (03PS3) 10Dzahn: phabricator: avoid duplicate list of servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/988112 (https://phabricator.wikimedia.org/T354221) [21:02:50] (03PS1) 10Eevans: restbase: configure new hosts for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/989248 (https://phabricator.wikimedia.org/T352468) [21:02:52] !log aqu@deploy2002 Finished deploy [analytics/refinery@c4fed56] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c4fed56c] (duration: 03m 33s) [21:02:54] (03PS4) 10Dzahn: phabricator: avoid duplicate list of servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/988112 (https://phabricator.wikimedia.org/T354221) [21:03:42] (SystemdUnitFailed) firing: wdqs-updater.service Failed on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:03:44] IRC bot did not update the topic again [21:03:48] !log aqu@deploy2002 Started deploy [analytics/refinery@c4fed56] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c4fed56c] (test number 2 after permission error) [21:03:54] !log aqu@deploy2002 Finished deploy [analytics/refinery@c4fed56] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c4fed56c] (test number 2 after permission error) (duration: 00m 05s) [21:04:45] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10BCornwall) 05Stalled→03Declined Will do. @BBlack, if you still feel strongly about this please reopen :) [21:05:10] (03PS2) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics Chrome private prefetch [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [21:05:27] (SystemdUnitFailed) resolved: (3) wdqs-updater.service Failed on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:23] (03PS3) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics Chrome private prefetch [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [21:07:57] (SystemdUnitFailed) firing: wdqs-updater.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:42] (SystemdUnitFailed) resolved: (3) wdqs-updater.service Failed on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:12] (SystemdUnitFailed) firing: (3) wdqs-updater.service Failed on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:20] ryankemper, inflatador : do we have an issue with WDQS ? [21:12:57] (SystemdUnitFailed) resolved: (3) wdqs-updater.service Failed on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:12:58] (03CR) 10Dzahn: "I see. Maybe you can ask them in a follow-up question how this works. Like do they mean you should manually run this script once?" [puppet] - 10https://gerrit.wikimedia.org/r/988679 (https://phabricator.wikimedia.org/T354484) (owner: 10AOkoth) [21:13:27] (SystemdUnitFailed) firing: (2) wdqs-updater.service Failed on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:12] (SystemdUnitFailed) resolved: (4) wdqs-updater.service Failed on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:27] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "compiler shows only the order of host names changes:" [puppet] - 10https://gerrit.wikimedia.org/r/988112 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [21:21:01] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@ea53374]: Regular airflow-dags/analytics_test weekly train [airflow-dags@ea53374f] [21:21:13] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@ea53374]: Regular airflow-dags/analytics_test weekly train [airflow-dags@ea53374f] (duration: 00m 12s) [21:22:35] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@ea53374]: Regular airflow-dags/analytics weekly train [airflow-dags@ea53374f] [21:23:03] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@ea53374]: Regular airflow-dags/analytics weekly train [airflow-dags@ea53374f] (duration: 00m 28s) [21:27:12] (03PS1) 10Andrew Bogott: wmcs-dnsleaks: fix prometheus comments and metric name [puppet] - 10https://gerrit.wikimedia.org/r/989249 (https://phabricator.wikimedia.org/T354365) [21:28:26] (03CR) 10CI reject: [V: 04-1] wmcs-dnsleaks: fix prometheus comments and metric name [puppet] - 10https://gerrit.wikimedia.org/r/989249 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [21:29:34] (03PS2) 10Andrew Bogott: wmcs-dnsleaks: fix prometheus comments and metric name [puppet] - 10https://gerrit.wikimedia.org/r/989249 (https://phabricator.wikimedia.org/T354365) [21:30:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10thcipriani) > To be able to access deployed Wiki instances and debug issues that are only reproducible in production Can you say more about this? The `restrict... [21:32:47] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dnsleaks: fix prometheus comments and metric name [puppet] - 10https://gerrit.wikimedia.org/r/989249 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [21:33:45] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [21:35:10] (03PS4) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics Chrome private prefetch [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [21:41:15] (03PS5) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics Chrome private prefetch [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [21:45:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:45:26] (03Abandoned) 10Bking: wdqs: Enable TLS for test tier [puppet] - 10https://gerrit.wikimedia.org/r/989236 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [21:47:59] (03CR) 10Dr0ptp4kt: "Looking for code reviews and Traffic-guided deployment." [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [21:49:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:59:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:00:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:15:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:29:11] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:49:54] (03CR) 10Tim Starling: "In PS2 of I111c01ae7e07ac7a943a32192c867ce9754b690a I made it so that attempting scaling with this config will return a MediaTransformErro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [22:54:14] PROBLEM - PHP opcache health on mw2426 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:54:46] PROBLEM - PHP opcache health on mw2279 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:57:10] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:00:59] (03PS1) 10Andrew Bogott: designate alert: Add dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/989255 (https://phabricator.wikimedia.org/T354365) [23:03:03] (03CR) 10Andrew Bogott: [C: 03+2] designate alert: Add dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/989255 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [23:07:10] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:11:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:21:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:23:04] PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:28:47] (03CR) 10Cwhite: [C: 03+1] "Looks like a noop for logstash - thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/989086 (https://phabricator.wikimedia.org/T296533) (owner: 10Muehlenhoff) [23:34:52] PROBLEM - PHP opcache health on mw2444 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:35:12] PROBLEM - PHP opcache health on mw2278 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:37:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:42:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:45:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:50:18] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk