[00:03:55] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [00:13:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P45824 and previous config saved to /var/cache/conftool/dbconfig/20230314-001313-marostegui.json [00:17:26] (03CR) 10Andrew Bogott: [C: 03+2] rbd2backy2.py: handle empty 'expire' stamps [puppet] - 10https://gerrit.wikimedia.org/r/897955 (owner: 10Andrew Bogott) [00:21:25] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [00:28:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T329260)', diff saved to https://phabricator.wikimedia.org/P45825 and previous config saved to /var/cache/conftool/dbconfig/20230314-002819-marostegui.json [00:28:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:28:25] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [00:28:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:28:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T329260)', diff saved to https://phabricator.wikimedia.org/P45826 and previous config saved to /var/cache/conftool/dbconfig/20230314-002840-marostegui.json [00:39:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T329260)', diff saved to https://phabricator.wikimedia.org/P45827 and previous config saved to /var/cache/conftool/dbconfig/20230314-003903-marostegui.json [00:39:09] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [00:48:23] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:48:25] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:48:33] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P45828 and previous config saved to /var/cache/conftool/dbconfig/20230314-005409-marostegui.json [00:57:25] RECOVERY - Check systemd state on aphlict2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10phaultfinder) [01:09:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P45829 and previous config saved to /var/cache/conftool/dbconfig/20230314-010915-marostegui.json [01:14:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:19:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:24:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T329260)', diff saved to https://phabricator.wikimedia.org/P45830 and previous config saved to /var/cache/conftool/dbconfig/20230314-012421-marostegui.json [01:24:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:24:28] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [01:24:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T329260)', diff saved to https://phabricator.wikimedia.org/P45831 and previous config saved to /var/cache/conftool/dbconfig/20230314-012442-marostegui.json [01:29:49] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10Papaul) @wiki_willy thank you for the heads up. @MatthewVernon i checked the systemc [01:35:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T329260)', diff saved to https://phabricator.wikimedia.org/P45832 and previous config saved to /var/cache/conftool/dbconfig/20230314-013504-marostegui.json [01:35:12] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [01:50:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P45833 and previous config saved to /var/cache/conftool/dbconfig/20230314-015011-marostegui.json [01:59:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0200) [02:04:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:05:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P45834 and previous config saved to /var/cache/conftool/dbconfig/20230314-020517-marostegui.json [02:07:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.27 [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/897929 (https://phabricator.wikimedia.org/T330205) [02:07:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.27 [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/897929 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [02:09:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T329260)', diff saved to https://phabricator.wikimedia.org/P45835 and previous config saved to /var/cache/conftool/dbconfig/20230314-022023-marostegui.json [02:20:30] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [02:22:21] !log removed user's 2FA on wikitech for T331955 [02:22:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) Your dispatch shipped on 3/13/2023 4:41 PM [02:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:16] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.27 [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/897929 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [02:24:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:31:43] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Samwilson) [02:34:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:44:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:49:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0300) [03:00:23] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:14] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898245 (https://phabricator.wikimedia.org/T330205) [03:01:16] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898245 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [03:02:02] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898245 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [03:02:27] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.40.0-wmf.27 refs T330205 [03:02:34] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [03:07:07] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: Short leveling up notification delay for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898246 (https://phabricator.wikimedia.org/T330358) [03:18:13] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:29:05] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:53:30] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.40.0-wmf.27 refs T330205 (duration: 51m 02s) [03:53:35] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [03:55:52] !log mwpresync@deploy2002 Pruned MediaWiki: 1.40.0-wmf.25 (duration: 02m 20s) [04:56:39] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.122`. Pre-deploy tests passing on canary `wdqs1003` [04:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:53] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@61ef435]: 0.3.122 [04:57:23] !log [WDQS Deploy] Tests passing following deploy of `0.3.122` on canary `wdqs1003`; proceeding to rest of fleet [04:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:39] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@61ef435]: 0.3.122 (duration: 08m 45s) [05:07:07] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [05:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:11] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [05:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:22] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [05:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0600) [06:00:05] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0600). [06:04:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:04:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:16:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:16:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:16:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T329260)', diff saved to https://phabricator.wikimedia.org/P45836 and previous config saved to /var/cache/conftool/dbconfig/20230314-061633-marostegui.json [06:16:39] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [06:19:41] (03PS8) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [06:24:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:28:37] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts centrallog1001 [06:32:00] (03PS1) 10Marostegui: production-m2.sql.erb: New user for excimer database [puppet] - 10https://gerrit.wikimedia.org/r/898447 (https://phabricator.wikimedia.org/T331956) [06:33:09] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:34:10] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [06:34:29] (03CR) 10Marostegui: "This requires manual creation on the database. This patch is just for tracking purposes" [puppet] - 10https://gerrit.wikimedia.org/r/898447 (https://phabricator.wikimedia.org/T331956) (owner: 10Marostegui) [06:34:43] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:34:53] (03CR) 10Andrea Denisse: [C: 03+1] logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [06:35:27] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi) [06:36:11] (03PS2) 10Marostegui: production-m2.sql.erb: New user for excimer database [puppet] - 10https://gerrit.wikimedia.org/r/898447 (https://phabricator.wikimedia.org/T331956) [06:36:18] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: centrallog1001 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [06:41:46] !log gerrit: changed `operations/puppet` merge strategy to allow "content merges" (see `ops` list for the rationale) [06:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:46] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: centrallog1001 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [06:42:46] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:42:47] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts centrallog1001 [06:43:05] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: New user for excimer database [puppet] - 10https://gerrit.wikimedia.org/r/898447 (https://phabricator.wikimedia.org/T331956) (owner: 10Marostegui) [06:46:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T329260)', diff saved to https://phabricator.wikimedia.org/P45837 and previous config saved to /var/cache/conftool/dbconfig/20230314-064630-marostegui.json [06:46:36] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:00:04] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45838 and previous config saved to /var/cache/conftool/dbconfig/20230314-070137-marostegui.json [07:10:30] (03PS1) 10Marostegui: mariadb: Migrate db2135 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/898456 (https://phabricator.wikimedia.org/T322294) [07:11:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Migrate db2135 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/898456 (https://phabricator.wikimedia.org/T322294) (owner: 10Marostegui) [07:13:53] !log Migrate db2135 to mariadb m5 codfw dbmaint 10.6 [07:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45839 and previous config saved to /var/cache/conftool/dbconfig/20230314-071643-marostegui.json [07:25:59] !log Migrate db1183 to mariadb m5 eqiad dbmaint 10.6 T322294 [07:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:04] T322294: Migrate m5 section to MariaDB 10.6 - https://phabricator.wikimedia.org/T322294 [07:26:38] (03PS1) 10Marostegui: mariadb: Migrate db1183 to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/898672 (https://phabricator.wikimedia.org/T322294) [07:27:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Migrate db1183 to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/898672 (https://phabricator.wikimedia.org/T322294) (owner: 10Marostegui) [07:31:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T329260)', diff saved to https://phabricator.wikimedia.org/P45840 and previous config saved to /var/cache/conftool/dbconfig/20230314-073149-marostegui.json [07:31:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [07:31:55] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:32:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [07:32:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T329260)', diff saved to https://phabricator.wikimedia.org/P45841 and previous config saved to /var/cache/conftool/dbconfig/20230314-073210-marostegui.json [07:57:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T329260)', diff saved to https://phabricator.wikimedia.org/P45842 and previous config saved to /var/cache/conftool/dbconfig/20230314-075730-marostegui.json [07:57:37] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:00:33] (03CR) 10Muehlenhoff: [C: 03+2] Configure database size for MDB backend [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [08:04:50] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) @karapayneWMDE : This needs your sign off on the WMDE side. @thcipriani : This needs your approval for the deployment access [08:05:08] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) [08:08:23] (03PS1) 10Muehlenhoff: Add itamar to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/898675 (https://phabricator.wikimedia.org/T331899) [08:12:09] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) >>! In T331647#8688948, @xcollazo wrote: > Hal needs to deploy to the `platform-eng` Airflow instance. So he needs `platform-eng-deployers`? That's the correct group, yes (although per t... [08:12:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45843 and previous config saved to /var/cache/conftool/dbconfig/20230314-081236-marostegui.json [08:20:42] (03CR) 10Cathal Mooney: [C: 03+2] Restrict prefix length for public announce, allow bgp for cloud range (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/896200 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [08:27:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45845 and previous config saved to /var/cache/conftool/dbconfig/20230314-082743-marostegui.json [08:31:45] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:50] !log fetch haproxy 2.6.10 for thirdparty/haproxy26 (buster && bullseye) @ apt.wm.o [08:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:04] (03PS1) 10Cathal Mooney: Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) [08:32:51] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [08:34:45] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:53] (03CR) 10Ayounsi: [C: 03+1] "1 small comment then lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [08:36:17] (03CR) 10Ayounsi: [C: 04-1] "Almost forgot, it needs to go with a change in config/sites.yaml as well." [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [08:38:48] !log test HAProxy 2.6.10 in cp4044 and cp4045 [08:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T329260)', diff saved to https://phabricator.wikimedia.org/P45846 and previous config saved to /var/cache/conftool/dbconfig/20230314-084249-marostegui.json [08:42:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [08:42:56] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:42:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:43:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [08:44:45] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:47:46] 10SRE, 10Infrastructure-Foundations, 10netops: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) Indeed! looks like Cisco specific :( I sent an email t our account rep just in case: > Additionally I was wondering if Junos supported in any way forwardingStatus in IP... [08:47:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:49:45] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:58] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Vgutierrez) [08:50:13] (03PS1) 10Filippo Giunchedi: karma: change 'source' label color [puppet] - 10https://gerrit.wikimedia.org/r/898682 [08:52:37] (03CR) 10Filippo Giunchedi: [C: 03+2] "Trivial change thus self-merge" [puppet] - 10https://gerrit.wikimedia.org/r/898682 (owner: 10Filippo Giunchedi) [08:53:00] (03PS3) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [09:01:14] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/897950 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:02:26] (03PS4) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [09:03:37] (03PS5) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [09:04:47] (03CR) 10Stevemunene: [C: 03+1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [09:04:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:06:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [09:06:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [09:06:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T329260)', diff saved to https://phabricator.wikimedia.org/P45847 and previous config saved to /var/cache/conftool/dbconfig/20230314-090649-marostegui.json [09:06:50] (03Merged) 10jenkins-bot: cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/897950 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:06:55] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:23:19] !log reboot ms-be2040 T331860 [09:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:25] T331860: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 [09:24:11] PROBLEM - Host ms-be2040 is DOWN: PING CRITICAL - Packet loss = 100% [09:24:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:28:12] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [09:29:05] (03CR) 10Jbond: [C: 03+2] node_regex: add a fixer [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897925 (owner: 10Jbond) [09:29:20] (03PS1) 10Cathal Mooney: Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) [09:29:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:30:26] (03PS1) 10Jbond: 1.1.2: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/898685 [09:31:14] (03CR) 10Jbond: [C: 03+2] 1.1.2: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/898685 (owner: 10Jbond) [09:31:55] (03PS2) 10Cathal Mooney: Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) [09:32:08] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:32:11] (03CR) 10Cathal Mooney: Modify policy to use in aggregate for 185.15.57.0/24 in codfw (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [09:33:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T329260)', diff saved to https://phabricator.wikimedia.org/P45848 and previous config saved to /var/cache/conftool/dbconfig/20230314-093321-marostegui.json [09:33:27] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:34:14] Reminder we'll be repooling eqiad RO (so all active/active services, including mw ones) at 11:30UTC [09:35:56] (03PS6) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [09:36:12] !log installing NSS security updates [09:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:17] (03CR) 10Elukey: calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:37:43] (03PS1) 10JMeybohm: cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/898686 (https://phabricator.wikimedia.org/T325292) [09:38:26] (03PS1) 10DCausse: wdqs: export more jmx metrics to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/898687 (https://phabricator.wikimedia.org/T331405) [09:39:22] (03CR) 10JMeybohm: [V: 03+1] calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:39:44] (03PS3) 10Samtar: docroot: Update privacy policy footer link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [09:39:51] (03PS3) 10Samtar: [foundationwiki] Grant translation admin rights to 'editor' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 (https://phabricator.wikimedia.org/T297396) (owner: 10Varnent) [09:40:18] (03PS3) 10Samtar: Enable VE on more namespaces on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders) [09:40:28] (03CR) 10Elukey: calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:40:32] (03CR) 10Elukey: [C: 03+1] calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:40:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, wiki has clear consensus and diffConfig looks correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe) [09:40:46] jouncebot: nowandnext [09:40:47] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [09:40:47] In 0 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1000) [09:42:44] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:43:19] TheresNoTime: Just in case you didn't see it, I'll lock scap deployments at 10:00UTC in anticipation of eqiad RO repool at 10:30UTC [09:43:29] Corrected reminder we'll be repooling eqiad RO (so all active/active services, including mw ones) at 10:30UTC [09:43:39] ah I saw 11:30 :D [09:43:52] Yeah I messed up my timezones [09:44:00] >< [09:44:02] * TheresNoTime will not deploy [09:44:22] It should be quick-ish, so you can probably deploy right after [09:44:38] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/898686 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45849 and previous config saved to /var/cache/conftool/dbconfig/20230314-094828-marostegui.json [09:49:53] (03Merged) 10jenkins-bot: cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/898686 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:51:02] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:51:17] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10cmooney) >>! In T331886#8688615, @ayounsi wrote: >> One of my concerns is our other caching sites use matched routers for redundancy and we coul... [09:51:34] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:53:43] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Ottomata) Yes, and that's okay. The group Hal should be in then is [[ https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L1009 | analytics-platform-eng-admins ]]... [09:53:43] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:54:40] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:56:33] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [09:56:42] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) 05Open→03In progress [09:56:43] !log disabling puppet on P:calico::kubernetes for T325268 [09:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:48] T325268: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268 [09:57:03] RECOVERY - Host ms-be2040 is UP: PING OK - Packet loss = 0%, RTA = 33.00 ms [09:58:01] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:58:15] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1000) [10:00:04] claime: A patch you scheduled for MediaWiki infrastucture (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:15] !log Locking scap deployment for service switchover - T330651 [10:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:20] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [10:00:43] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10elukey) [10:00:54] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:02:35] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:02:37] !log Locking scap deployment for service switchover - T331541 [10:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:42] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [10:03:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45850 and previous config saved to /var/cache/conftool/dbconfig/20230314-100334-marostegui.json [10:04:59] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10MatthewVernon) @Papaul that's only 13 disks, not 14? The recent activity panel in the iDRAC shows: ` 2023-03-12T15:08:42-0500 Virtual Disk 8 on Integrated RAID Controller... [10:14:18] (03PS1) 10Jbond: pki: move services to pki2002 [dns] - 10https://gerrit.wikimedia.org/r/898693 [10:15:45] !log enabling puppet on P:calico::kubernetes for T325268 [10:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:51] T325268: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268 [10:17:39] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10akosiaris) [10:17:43] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [10:17:47] 10SRE: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10akosiaris) [10:17:51] 10SRE: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10akosiaris) [10:18:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T329260)', diff saved to https://phabricator.wikimedia.org/P45851 and previous config saved to /var/cache/conftool/dbconfig/20230314-101840-marostegui.json [10:18:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [10:18:48] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [10:19:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [10:19:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:19:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:19:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45852 and previous config saved to /var/cache/conftool/dbconfig/20230314-101918-marostegui.json [10:19:54] 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris) [10:20:18] 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris) [10:20:21] (03PS5) 10Elukey: api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) [10:20:22] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [10:20:25] (03PS2) 10Elukey: services: allow anon traffic for liftwing's paths in API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/897844 (https://phabricator.wikimedia.org/T331547) [10:20:27] (03PS1) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) [10:20:59] (03PS2) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) [10:21:35] !log move pki.discovery.wmnet to pki2002 (buyllseye) [10:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:07] (03CR) 10Kosta Harlan: [C: 03+2] [beta] GrowthExperiments: Short leveling up notification delay for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898246 (https://phabricator.wikimedia.org/T330358) (owner: 10Gergő Tisza) [10:23:11] 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris) p:05Triage→03Medium [10:23:39] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10akosiaris) p:05Triage→03Medium [10:25:08] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:25:39] (03CR) 10Jbond: [C: 03+2] pki: move services to pki2002 [dns] - 10https://gerrit.wikimedia.org/r/898693 (owner: 10Jbond) [10:28:19] !log Running sre.switchdc.mediawiki.00-optional-warmup-caches - T331541 [10:28:23] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches [10:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:25] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [10:28:33] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=99) [10:28:38] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches [10:28:56] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet on all recursors [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:00] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet on all recursors [10:29:15] PROBLEM - Host centrallog1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:29:49] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:51] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:30:56] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10SLyngshede-WMF) Attributes we will be needing: - wikimediaGlobalAccountId (MediaWiki SUL account) (optional) - wikimediaGlobalAccountName (MediaWiki SUL account... [10:32:27] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=0) [10:32:51] !log Repooling all active/active services in eqiad - T331541 [10:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:01] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:33:13] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 [10:33:19] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 started. [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:35] PROBLEM - DPKG on stat1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:38:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45853 and previous config saved to /var/cache/conftool/dbconfig/20230314-103813-marostegui.json [10:38:19] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [10:39:52] (03CR) 10Ilias Sarantopoulos: [C: 03+1] services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey) [10:42:52] !log reimage pki-root1001 [10:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:40] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host pki-root1001.eqiad.wmnet with OS bullseye [10:47:52] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 [10:47:57] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [10:47:58] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 comple... [10:48:12] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool restbase-async in codfw: T331541 [10:48:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: T331541 [10:48:59] ACKNOWLEDGEMENT - dump of m2 in eqiad on backupmon1001 is CRITICAL: dump for m2 at eqiad (db1117) taken more than a week ago: Most recent backup 2023-02-28 03:17:30 Marostegui Waiting for the retry https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:49:17] RECOVERY - dump of m2 in eqiad on backupmon1001 is OK: Last dump for m2 at eqiad (db1117) taken on 2023-03-14 03:23:39 (550 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:49:31] (03CR) 10Volans: [C: 03+2] docstrings: automatically document type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans) [10:53:19] (03Merged) 10jenkins-bot: docstrings: automatically document type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans) [10:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45854 and previous config saved to /var/cache/conftool/dbconfig/20230314-105319-marostegui.json [10:53:21] (03PS1) 10Elukey: ml-services: tune autoscaling for ORES model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) [10:53:53] (03CR) 10Hnowlan: [C: 03+1] "lgtm, minor query" [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey) [10:58:48] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pki-root1001.eqiad.wmnet with reason: host reimage [10:59:20] (03CR) 10Hnowlan: [C: 03+1] "Very neat!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [10:59:31] (03CR) 10Hnowlan: [C: 03+1] services: allow anon traffic for liftwing's paths in API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/897844 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [10:59:33] (03CR) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey) [10:59:37] claime: can you let me know when I'm okay to do a quick deploy? [10:59:59] TheresNoTime: We're experiencing some weirdness in pooling status rn, will tell you when we're ok [11:00:08] ack, good luck! :) [11:00:28] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "These look fine for now!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [11:02:44] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki-root1001.eqiad.wmnet with reason: host reimage [11:02:44] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache api-ro.discovery.wmnet on all recursors [11:02:47] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) api-ro.discovery.wmnet on all recursors [11:02:54] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) >>! In T331647#8690861, @Ottomata wrote: > Yes, and that's okay. > > The group Hal should be in then is [[ https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/... [11:03:55] !log akosiaris@cumin1001 START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors [11:03:58] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors [11:06:55] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: tune autoscaling for ORES model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [11:07:36] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45855 and previous config saved to /var/cache/conftool/dbconfig/20230314-110826-marostegui.json [11:08:37] (03CR) 10Ayounsi: [C: 03+1] Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [11:11:22] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:12:07] (03PS1) 10Urbanecm: arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) [11:12:36] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:12:46] TheresNoTime: once you're given the green light to deploy, would you mind taking ^^ with you? [11:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:13:04] (03CR) 10CI reject: [V: 04-1] arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) (owner: 10Urbanecm) [11:13:05] urbanecm: 898700? Sure :) [11:13:08] yup [11:13:13] * urbanecm goes to fix CI in the meantime [11:13:18] (on that patch) [11:13:23] ty [11:13:37] !log We are encountering unexpected DNS anycast issued following T331541, latencies are increased but no production outage. [11:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:42] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [11:13:58] (03PS2) 10Urbanecm: arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) [11:16:15] (03PS1) 10Filippo Giunchedi: thanos: add pint for thanos-rule [puppet] - 10https://gerrit.wikimedia.org/r/898701 (https://phabricator.wikimedia.org/T309182) [11:17:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:19:23] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache api-ro.discovery.wmnet on all recursors [11:19:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) api-ro.discovery.wmnet on all recursors [11:20:41] (03CR) 10Elukey: [C: 03+2] api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [11:21:04] (03CR) 10Elukey: [C: 03+2] services: allow anon traffic for liftwing's paths in API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/897844 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [11:21:46] (03CR) 10EoghanGaffney: [C: 03+2] Add blackbox http check for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/897917 (https://phabricator.wikimedia.org/T327978) (owner: 10EoghanGaffney) [11:22:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40102/console" [puppet] - 10https://gerrit.wikimedia.org/r/898701 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [11:23:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45856 and previous config saved to /var/cache/conftool/dbconfig/20230314-112333-marostegui.json [11:23:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [11:23:39] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:23:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [11:23:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T329260)', diff saved to https://phabricator.wikimedia.org/P45857 and previous config saved to /var/cache/conftool/dbconfig/20230314-112354-marostegui.json [11:27:24] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: sync [11:27:47] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [11:29:18] (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:34:18] (MediaWikiMemcachedHighErrorRate) resolved: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:35:18] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:38:09] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [11:38:17] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:38:30] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [11:39:07] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [11:39:34] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:40:58] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:41:28] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:41:42] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [11:42:01] 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) [11:42:02] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:42:19] 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) p:05Triage→03High [11:42:21] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [11:43:58] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:46:21] (03PS3) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) [11:46:35] (03CR) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey) [11:49:08] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Ottomata) Hm, that group (as well as analytics-research-admins) gives some sudo rights to a system user (analytics-platform-eng) that does have analytics-privatedata-users access, so I think it does require... [11:49:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T329260)', diff saved to https://phabricator.wikimedia.org/P45860 and previous config saved to /var/cache/conftool/dbconfig/20230314-114957-marostegui.json [11:50:03] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:51:35] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool appservers-ro in eqiad: T331541 [11:51:40] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [11:51:58] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache appservers-ro.discovery.wmnet on all recursors [11:52:01] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) appservers-ro.discovery.wmnet on all recursors [11:52:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool appservers-ro in eqiad: T331541 [11:58:48] (03CR) 10Hnowlan: [C: 03+1] services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey) [12:01:07] (03CR) 10Elukey: [C: 03+2] services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey) [12:03:42] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: sync [12:03:55] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [12:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45861 and previous config saved to /var/cache/conftool/dbconfig/20230314-120503-marostegui.json [12:05:14] 10ops-eqiad, 10cloud-services-team (Hardware): cloudcontrol1007: power supply temperature critical - https://phabricator.wikimedia.org/T331984 (10aborrero) [12:05:41] 10ops-eqiad, 10cloud-services-team (Hardware): cloudcontrol1007: power supply temperature critical - https://phabricator.wikimedia.org/T331984 (10aborrero) p:05Triage→03Medium [12:06:53] !log Unlocked scap deployments - T331541 [12:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:58] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [12:07:03] TheresNoTime, urbanecm [12:07:12] claime: thank you :) [12:07:15] ty [12:07:17] Go ahead, we're still having issues, but nothing that warrants blocking scap deploymetns [12:08:14] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route pool appservers-ro in eqiad: T331541 [12:08:15] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache appservers-ro.discovery.wmnet on all recursors [12:08:19] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) appservers-ro.discovery.wmnet on all recursors [12:08:32] 10SRE, 10Scap, 10serviceops-collab, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10eoghan) [12:08:42] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [12:08:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 (https://phabricator.wikimedia.org/T297396) (owner: 10Varnent) [12:09:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [12:09:28] (03Merged) 10jenkins-bot: docroot: Update privacy policy footer link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [12:09:31] (03Merged) 10jenkins-bot: [foundationwiki] Grant translation admin rights to 'editor' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 (https://phabricator.wikimedia.org/T297396) (owner: 10Varnent) [12:10:03] (03PS1) 10Jbond: recursour: only forward to the local ns server [puppet] - 10https://gerrit.wikimedia.org/r/898704 [12:11:29] !log samtar@deploy2002 Started scap: Backport for [[gerrit:896224|[foundationwiki] Grant translation admin rights to 'editor' group (T297396)]], [[gerrit:896216|docroot: Update privacy policy footer link (T331680)]] [12:11:35] T297396: Expand Governance Wiki Editor user group rights to include translate admin rights - https://phabricator.wikimedia.org/T297396 [12:11:35] T331680: Update footer links - https://phabricator.wikimedia.org/T331680 [12:13:12] !log samtar@deploy2002 samtar and varnent: Backport for [[gerrit:896224|[foundationwiki] Grant translation admin rights to 'editor' group (T297396)]], [[gerrit:896216|docroot: Update privacy policy footer link (T331680)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [12:13:14] (testing) [12:13:18] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool appservers-ro in eqiad: T331541 [12:13:23] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [12:13:53] (syncing) [12:13:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:14:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10RobH) Since this is out of warranty, the pending purchase of 5 disks was raised to 7 on T331988 to accommodate this repair. [12:15:30] 10SRE, 10serviceops: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10LSobanski) [12:15:42] I have a failure in scap [12:15:55] `Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.` - https://phabricator.wikimedia.org/P45862 [12:16:17] (scap is rolling back) [12:16:40] 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) [12:18:00] (03CR) 10Kosta Harlan: "Is there anything that needs to happen to deploy this change? Does that happen automatically?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [12:18:54] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host pki-root1001.eqiad.wmnet with OS bullseye [12:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:19:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:20:05] !log `Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.` (P45862) during scap deployment of T297396 + T331680 — scap rolled back [12:20:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45863 and previous config saved to /var/cache/conftool/dbconfig/20230314-122009-marostegui.json [12:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:11] T297396: Expand Governance Wiki Editor user group rights to include translate admin rights - https://phabricator.wikimedia.org/T297396 [12:20:11] T331680: Update footer links - https://phabricator.wikimedia.org/T331680 [12:20:41] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:896224|[foundationwiki] Grant translation admin rights to 'editor' group (T297396)]], [[gerrit:896216|docroot: Update privacy policy footer link (T331680)]] (duration: 09m 12s) [12:21:23] ... okay but those changes were actually sync'd.. [12:21:37] 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Marostegui) I just wanted to mention that despite of the sudden spike on DB reads, our databases kept up just fine in general. We did have timeouts on some enwiki (s1) replicas... [12:21:45] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [12:23:47] !log installing git security updates [12:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:13] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:19] (03PS4) 10Samtar: Enable VE on more namespaces on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders) [12:24:22] (03PS3) 10Samtar: arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) (owner: 10Urbanecm) [12:27:32] (03PS8) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) [12:27:55] 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Johan) [12:28:10] 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Johan) [12:29:15] (03PS1) 10Hnowlan: thumbor: bump workers, reduce cpu, increase haproxy queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/898728 (https://phabricator.wikimedia.org/T328033) [12:31:42] (03PS9) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) [12:35:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T329260)', diff saved to https://phabricator.wikimedia.org/P45864 and previous config saved to /var/cache/conftool/dbconfig/20230314-123515-marostegui.json [12:35:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:35:21] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [12:35:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:36:47] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) [12:37:45] (03CR) 10Arturo Borrero Gonzalez: "To configure BIRD we need to know which IP address we will be using." [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:39:39] (03PS41) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [12:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:41:45] (03PS1) 10Hokwelum: Add new mirror details to list of dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/898729 [12:41:59] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: give them proper role [puppet] - 10https://gerrit.wikimedia.org/r/898730 (https://phabricator.wikimedia.org/T324992) [12:43:14] (03PS2) 10Hokwelum: Add new mirror details to list of dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/898729 [12:43:20] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40104/console" [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:43:43] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [12:44:31] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2003-dev.codfw.wmnet with OS bullseye [12:44:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:45:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: give them proper role [puppet] - 10https://gerrit.wikimedia.org/r/898730 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:45:41] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: give them proper role [puppet] - 10https://gerrit.wikimedia.org/r/898730 (https://phabricator.wikimedia.org/T324992) [12:48:38] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:53:19] (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:53:43] (03CR) 10ArielGlenn: [C: 03+2] Add new mirror details to list of dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/898729 (owner: 10Hokwelum) [12:54:03] (03CR) 10Nicolas Fraison: [C: 03+1] Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:54:57] (03PS1) 10Clément Goubert: sre.discovery.service-route: Reduce TTL before changes [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 [12:54:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:55:17] (03CR) 10Ayounsi: "Some comments, then indeed next step is to define a VIP pool." [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:55:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [12:56:08] (03CR) 10Nicolas Fraison: [C: 04-1] Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:57:34] (03CR) 10Btullis: [V: 03+1] Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:58:12] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage [12:58:33] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage [12:59:22] (03PS42) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [12:59:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1300). [13:00:05] TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1300) [13:00:05] xSavitar and raynor: A patch you scheduled for Mobileapps/RESTBase/Wikifeeds is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:25] * Lucas_WMDE is mostly afk [13:00:25] I can (self-)deploy! [13:00:53] (03CR) 10Btullis: Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [13:01:17] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:01:27] s'up [13:01:43] (I am holding deployment per ^) [13:02:09] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage [13:04:00] can someone ack the page? my splunk app is not cooperating... [13:04:07] On it [13:04:17] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage [13:04:35] Acked [13:05:04] XioNoX: you can use sirenbot [13:05:06] sobanski: you are fast [13:05:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [13:05:18] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [13:06:17] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:07:30] XioNoX: parsoid is having a worker starvation issue, but it doesn't seem related to an increase in queries [13:08:12] * urandom is slow [13:08:22] fyi, the link on that page returns a "Panel not found" [13:08:47] XioNoX: It works for me, at least the grafana link [13:08:48] The Grafana one? It worked for me [13:09:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders) [13:09:03] uh? [13:09:16] which pannel does it link to? [13:09:27] I only get to the dashboard [13:09:31] Ah, there is a short-lived warning at the top of the page [13:09:34] It links to the dashboard [13:09:47] (03Merged) 10jenkins-bot: Enable VE on more namespaces on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders) [13:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:10:10] !log samtar@deploy2002 Started scap: Backport for [[gerrit:894094|Enable VE on more namespaces on foundationwiki (T331079)]] [13:10:11] But it shoudl link to https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=1678788598368&orgId=1&to=1678799398368&var-cluster=parsoid&var-datasource=codfw+prometheus%2Fops&viewPanel=64 [13:10:12] panelid=54 [13:10:16] T331079: Enable VisualEditor on all main namespaces on foundation.wikimedia.org - https://phabricator.wikimedia.org/T331079 [13:10:31] Yeah, sobanski, panelid is off by 10 [13:11:45] !log samtar@deploy2002 esanders and samtar: Backport for [[gerrit:894094|Enable VE on more namespaces on foundationwiki (T331079)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:11:56] (testing) [13:12:18] (syncing) [13:13:03] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:26] I think the grafana url is borked completely [13:14:39] It at least links to the right dash, but not to the fullscreen panel [13:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:26] (03PS1) 10BBlack: recdns: add a permissive ecs-add-for for new pdns [puppet] - 10https://gerrit.wikimedia.org/r/898736 [13:15:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10Papaul) @MatthewVernon you right i didn't read disk 6.I will see if i can find disks from old ms-be* [13:16:47] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/898736 (owner: 10BBlack) [13:17:13] (03CR) 10BBlack: [C: 03+2] recdns: add a permissive ecs-add-for for new pdns [puppet] - 10https://gerrit.wikimedia.org/r/898736 (owner: 10BBlack) [13:17:32] (03CR) 10Jbond: [C: 03+1] "could be more restrictive but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/898736 (owner: 10BBlack) [13:18:05] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:894094|Enable VE on more namespaces on foundationwiki (T331079)]] (duration: 07m 55s) [13:18:11] T331079: Enable VisualEditor on all main namespaces on foundation.wikimedia.org - https://phabricator.wikimedia.org/T331079 [13:18:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) (owner: 10Urbanecm) [13:18:37] !log rolling out recdns fixup for missing 10/8 ECS affecting local inter-dc discovery/geoip results [13:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:01] (03Merged) 10jenkins-bot: arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) (owner: 10Urbanecm) [13:19:22] !log samtar@deploy2002 Started scap: Backport for [[gerrit:898700|arwiki: Add new throttle rule (T331973)]] [13:19:27] T331973: Temporary lift IP cap for Wiki workshop at Birzeit University on 15-18 March 2023 - https://phabricator.wikimedia.org/T331973 [13:20:55] !log samtar@deploy2002 samtar and urbanecm: Backport for [[gerrit:898700|arwiki: Add new throttle rule (T331973)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:20:59] (syncing) [13:21:27] (03PS1) 10Clément Goubert: team-sre/mediawiki: Fix php-fpm alert dashboard links [alerts] - 10https://gerrit.wikimedia.org/r/898737 [13:22:35] XioNoX: sobanski ^ [13:23:25] PROBLEM - Recursive DNS on 2620:0:861:1:208:80:154:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:24:31] PROBLEM - Recursive DNS on 208.80.154.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:24:42] er [13:24:43] ^ that's unsettling! [13:24:46] what's this about [13:24:49] looking [13:24:58] but it could be a false alert, maybe the check depends on wrong ECS behvaior or whatever [13:24:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:25:01] PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:25:09] PROBLEM - Bird Internet Routing Daemon on dns1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:25:13] stopped the rollout though, in case [13:25:22] thanks looking [13:25:41] (03CR) 10LSobanski: [C: 03+1] team-sre/mediawiki: Fix php-fpm alert dashboard links [alerts] - 10https://gerrit.wikimedia.org/r/898737 (owner: 10Clément Goubert) [13:25:43] PROBLEM - Recursive DNS on 2620:0:861:4:208:80:155:108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:25:49] Mar 14 13:25:42 dns1002 pdns-recursor[3751219]: Mar 14 13:25:42 Exception: Trying to set unknown setting 'ecs-add-for: 0.0.0.0/0, ::/> [13:26:02] pdns-rec fails, hence anycast-hc fails hence bird falis [13:26:10] * claime curses dns [13:26:15] that's how it should work [13:26:27] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:26:29] (03CR) 10Clément Goubert: [C: 03+2] team-sre/mediawiki: Fix php-fpm alert dashboard links [alerts] - 10https://gerrit.wikimedia.org/r/898737 (owner: 10Clément Goubert) [13:26:29] it's failing over service to another DC at this point, since we hit all 3x eqiad now [13:26:33] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns1003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:26:45] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) (owner: 10Anzx) [13:26:47] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:898700|arwiki: Add new throttle rule (T331973)]] (duration: 07m 24s) [13:26:53] (03PS1) 10BBlack: Revert "recdns: add a permissive ecs-add-for for new pdns" [puppet] - 10https://gerrit.wikimedia.org/r/898720 [13:26:54] T331973: Temporary lift IP cap for Wiki workshop at Birzeit University on 15-18 March 2023 - https://phabricator.wikimedia.org/T331973 [13:27:06] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "recdns: add a permissive ecs-add-for for new pdns" [puppet] - 10https://gerrit.wikimedia.org/r/898720 (owner: 10BBlack) [13:27:09] PROBLEM - Bird Internet Routing Daemon on dns1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:27:15] !log close UTC afternoon backport window [13:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:27] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns2001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:27:28] (03CR) 10CI reject: [V: 04-1] Remove FlaggedRevs for ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) (owner: 10Anzx) [13:27:35] PROBLEM - Recursive DNS on 208.80.154.134 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:27:35] PROBLEM - Bird Internet Routing Daemon on dns1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:27:42] (03Merged) 10jenkins-bot: team-sre/mediawiki: Fix php-fpm alert dashboard links [alerts] - 10https://gerrit.wikimedia.org/r/898737 (owner: 10Clément Goubert) [13:27:43] PROBLEM - Recursive DNS on 2620:0:861:2:208:80:154:134 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:27:45] (03PS4) 10Zoranzoki21: Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [13:27:49] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:27:49] PROBLEM - Recursive DNS on 2620:0:860:3:208:80:153:77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:27:59] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:28:03] PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:28:25] PROBLEM - Recursive DNS on 208.80.153.77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:28:29] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:28:31] RECOVERY - Recursive DNS on 2620:0:861:1:208:80:154:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:28:43] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:49] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:29:15] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns2001 is OK: OK: UP (pid=2866885) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:29:15] RECOVERY - Recursive DNS on 2620:0:861:4:208:80:155:108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:29:21] RECOVERY - Recursive DNS on 208.80.154.134 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:29:25] RECOVERY - Bird Internet Routing Daemon on dns1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:29:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:29:48] (03PS1) 10Ssingh: recdns: add a permissive ecs-add-for for new pdns [puppet] - 10https://gerrit.wikimedia.org/r/898738 [13:30:10] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:30:18] RECOVERY - Bird Internet Routing Daemon on dns1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:30:37] (03CR) 10BBlack: [C: 03+1] "Syntax looks correct-er! :)" [puppet] - 10https://gerrit.wikimedia.org/r/898738 (owner: 10Ssingh) [13:30:53] (03CR) 10Ssingh: [C: 03+2] recdns: add a permissive ecs-add-for for new pdns [puppet] - 10https://gerrit.wikimedia.org/r/898738 (owner: 10Ssingh) [13:31:18] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 182, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:34] (03PS4) 10Anzx: Remove FlaggedRevs for ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) [13:32:52] RECOVERY - Bird Internet Routing Daemon on dns1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:33:20] !log rolling out recdns fixup for missing 10/8 ECS affecting local inter-dc discovery/geoip results (again, with sukhe's more-correct variant!) [13:33:21] (03CR) 10Btullis: "It looks fine in general. Two queries inline and once again I'd recommend getting Janis' opinion. The -1 is for the networkpolicy and the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [13:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:32] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10ayounsi) > Unless we want to replace both at that stage? Probably not > Ideally, longer-term, it would be nice to have both racks fairly symmet... [13:33:40] (03CR) 10Btullis: [C: 04-1] spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [13:34:22] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns1002 is OK: OK: UP (pid=3758841) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:34:25] (03Abandoned) 10Jbond: recursour: only forward to the local ns server [puppet] - 10https://gerrit.wikimedia.org/r/898704 (owner: 10Jbond) [13:34:35] (03CR) 10Samtar: "Did you use `composer manage-dblist` to remove `ptwikisource` from the `flaggedrevs` dblist?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) (owner: 10Anzx) [13:34:45] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:57] 10SRE, 10Infrastructure-Foundations, 10netops: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) 05Open→03Declined Closing this task. I'll reopen if there is anything useful that comes out of the conversation. Nothing too interesting for us in pmacct changelog n... [13:35:34] (03CR) 10Cathal Mooney: [C: 03+2] Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [13:37:26] (03Merged) 10jenkins-bot: Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [13:38:32] (03PS1) 10Elukey: services: add staging config for Lift Wing to the API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898741 [13:38:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:41:06] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:42:26] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:42] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns1003 is OK: OK: UP (pid=120398) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:43:58] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:47:07] (03CR) 10Ayounsi: [C: 03+1] "One small comment and LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [13:48:04] RECOVERY - Recursive DNS on 208.80.153.77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:49:38] RECOVERY - Recursive DNS on 208.80.154.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:49:48] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:11] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:20] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:52:23] (03CR) 10Andrew Bogott: [C: 03+2] paws/NFS: move paws to a project-local NFS server [puppet] - 10https://gerrit.wikimedia.org/r/896353 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [13:52:44] RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:52:58] RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:54:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:48] RECOVERY - Recursive DNS on 2620:0:860:3:208:80:153:77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:58:16] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [13:58:20] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [13:58:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [13:58:48] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [13:58:51] (03PS1) 10Filippo Giunchedi: netops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898746 (https://phabricator.wikimedia.org/T309182) [13:58:53] (03PS1) 10Filippo Giunchedi: structured-data: address warnings [alerts] - 10https://gerrit.wikimedia.org/r/898747 [13:59:45] (JobUnavailable) resolved: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:59:57] RECOVERY - Recursive DNS on 2620:0:861:2:208:80:154:134 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:00:35] (03PS1) 10Samtar: InitialiseSettings.php: Undeploy Phonos from afwiktionary, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898749 (https://phabricator.wikimedia.org/T332006) [14:00:48] !log reimage pki1001 [14:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:33] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host pki1001.eqiad.wmnet with OS bullseye [14:01:35] (03PS10) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [14:02:27] (03CR) 10Filippo Giunchedi: "test_alerts.py::test_lint_rule[team-structured-data/data_pipelines.yaml]" [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi) [14:02:35] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/898750 [14:05:16] (03CR) 10David Caro: [C: 03+2] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:05:20] (03CR) 10Ayounsi: [C: 03+1] netops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898746 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:05:23] (03CR) 10David Caro: [C: 03+2] maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [14:07:29] (03PS1) 10Filippo Giunchedi: dcops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898753 (https://phabricator.wikimedia.org/T309182) [14:07:31] (03PS1) 10Filippo Giunchedi: perf: deploy to 'ext' instance [alerts] - 10https://gerrit.wikimedia.org/r/898754 (https://phabricator.wikimedia.org/T309182) [14:08:01] (03CR) 10Cparle: "This looks fine to me Filippo, just wondering if there's any way I can test it to make sure I get a warning when I ought to?" [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi) [14:08:18] (03CR) 10Filippo Giunchedi: [C: 03+2] netops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898746 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:08:46] (03CR) 10Ayounsi: Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [14:09:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Decrease db2122 weight', diff saved to https://phabricator.wikimedia.org/P45866 and previous config saved to /var/cache/conftool/dbconfig/20230314-140926-root.json [14:09:27] 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) [14:09:32] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [14:09:41] (03CR) 10Filippo Giunchedi: structured-data: address warnings (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi) [14:09:50] (03PS1) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755 [14:12:15] (JobUnavailable) firing: (2) Reduced availability for job cfssl in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:12:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:05] 10SRE, 10Data-Persistence, 10serviceops: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [14:13:16] 10SRE, 10Data-Persistence, 10serviceops: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [14:13:22] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [14:13:37] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [14:14:01] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [14:14:12] (03CR) 10Cparle: [C: 03+2] structured-data: address warnings [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi) [14:14:26] (03PS2) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755 [14:14:28] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) p:05Triage→03Medium [14:14:50] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) [14:14:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:11] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:15:37] (03Merged) 10jenkins-bot: structured-data: address warnings [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi) [14:15:42] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) p:05High→03Medium [14:15:54] (03CR) 10Filippo Giunchedi: [C: 03+2] dcops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898753 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:16:06] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris) [14:16:19] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pki1001.eqiad.wmnet with reason: host reimage [14:16:29] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [14:16:32] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris) [14:16:41] !log All active/active services in eqiad repooled, DNS issues resolved - T331541 [14:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:46] T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 [14:17:17] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [14:17:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.491 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:17:43] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:17:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:18:48] 10SRE, 10serviceops: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10akosiaris) [14:19:03] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1001.eqiad.wmnet with reason: host reimage [14:19:06] 10SRE, 10serviceops: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10akosiaris) [14:19:13] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [14:19:48] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:20:33] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) 05In progress→03Resolved We ran into a powerdns configuration issue which meant that instead of traffic being spread over both datacenters, we completely switched... [14:21:11] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [14:21:43] 10SRE, 10serviceops: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris) [14:21:49] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [14:21:55] (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/898750 [14:22:01] 10SRE, 10serviceops: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris) [14:22:07] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [14:23:30] (03PS1) 10JMeybohm: cert-manager: Allow cfssl-issuer access to pki2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/898759 (https://phabricator.wikimedia.org/T331696) [14:23:51] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [14:24:09] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:25:28] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris) [14:27:38] (03PS3) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/898750 [14:30:50] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Allow cfssl-issuer access to pki2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/898759 (https://phabricator.wikimedia.org/T331696) (owner: 10JMeybohm) [14:31:17] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/898750 (owner: 10Muehlenhoff) [14:31:35] (03PS1) 10Filippo Giunchedi: search-platform: deploy alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/898765 (https://phabricator.wikimedia.org/T309182) [14:32:15] (JobUnavailable) resolved: Reduced availability for job cfssl in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:32:46] (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/898766 [14:35:54] (03Merged) 10jenkins-bot: cert-manager: Allow cfssl-issuer access to pki2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/898759 (https://phabricator.wikimedia.org/T331696) (owner: 10JMeybohm) [14:37:17] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki1001.eqiad.wmnet with OS bullseye [14:37:20] (03PS1) 10Alexandros Kosiaris: sre.discovery.service-route: Use DNS_TTL_SHORT [cookbooks] - 10https://gerrit.wikimedia.org/r/898767 [14:37:21] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:37:30] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:37:37] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:37:45] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:37:53] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:38:02] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:38:05] (03PS2) 10Filippo Giunchedi: search-platform: deploy alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/898765 (https://phabricator.wikimedia.org/T309182) [14:38:31] (03PS2) 10Alexandros Kosiaris: sre.discovery.service-route: Use DNS_TTL_SHORT [cookbooks] - 10https://gerrit.wikimedia.org/r/898767 [14:40:22] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/898766 (owner: 10Jgiannelos) [14:41:08] (03PS1) 10Filippo Giunchedi: structured-data: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898768 (https://phabricator.wikimedia.org/T309182) [14:42:09] !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert for pki1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin1001 [14:43:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for pki1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin1001 [14:43:36] (03CR) 10Cparle: [C: 03+2] structured-data: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898768 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:43:52] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10jbond) [14:44:51] (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/898766 (owner: 10Jgiannelos) [14:44:53] (03PS1) 10JMeybohm: cert-manager: Run cert-manager 1.10.1 in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/898770 (https://phabricator.wikimedia.org/T325292) [14:45:41] 10SRE, 10Infrastructure-Foundations: pki2001: decomission server - https://phabricator.wikimedia.org/T332018 (10jbond) [14:47:58] (03CR) 10Volans: "duplicate of Ia3917b61798b2b4e6fb0ff3676f19658f9565c72 ?" [cookbooks] - 10https://gerrit.wikimedia.org/r/898767 (owner: 10Alexandros Kosiaris) [14:50:07] (03CR) 10Volans: sre.discovery.service-route: Reduce TTL before changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 (owner: 10Clément Goubert) [14:50:17] 10SRE, 10Infrastructure-Foundations: pki2001: decommission server - https://phabricator.wikimedia.org/T332018 (10Aklapper) [14:51:21] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Run cert-manager 1.10.1 in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/898770 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [14:51:45] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:52:21] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:52:25] (03CR) 10Clément Goubert: sre.discovery.service-route: Reduce TTL before changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 (owner: 10Clément Goubert) [14:52:33] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:53:43] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:53:48] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:53:54] (03PS1) 10FNegri: [tbs.harbor] Fix wrong paths for Harbor certs [puppet] - 10https://gerrit.wikimedia.org/r/898773 (https://phabricator.wikimedia.org/T316323) [14:54:53] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:55:51] (03CR) 10CI reject: [V: 04-1] [tbs.harbor] Fix wrong paths for Harbor certs [puppet] - 10https://gerrit.wikimedia.org/r/898773 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [14:56:25] (03Merged) 10jenkins-bot: cert-manager: Run cert-manager 1.10.1 in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/898770 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [14:56:33] (03CR) 10Btullis: [C: 03+2] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:58:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [14:58:15] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [14:58:32] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 (owner: 10Clément Goubert) [14:59:37] !log jayme@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:00:12] !log jayme@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:00:16] (03CR) 10Clément Goubert: sre.discovery.service-route: Reduce TTL before changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 (owner: 10Clément Goubert) [15:01:06] (03Abandoned) 10Alexandros Kosiaris: sre.discovery.service-route: Use DNS_TTL_SHORT [cookbooks] - 10https://gerrit.wikimedia.org/r/898767 (owner: 10Alexandros Kosiaris) [15:02:43] PROBLEM - haproxy process on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [15:02:55] PROBLEM - Check systemd state on cloudlb2002-dev is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:59] PROBLEM - haproxy alive on cloudlb2002-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [15:04:41] PROBLEM - haproxy alive on cloudlb2003-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [15:05:03] (03PS6) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) [15:05:17] PROBLEM - haproxy process on cloudlb2003-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [15:05:21] PROBLEM - Check systemd state on cloudlb2003-dev is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:40] (03PS1) 10Filippo Giunchedi: netops: split routinator from ping offload [alerts] - 10https://gerrit.wikimedia.org/r/898776 (https://phabricator.wikimedia.org/T309182) [15:07:56] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [15:08:22] (03PS1) 10JHathaway: kernel-purge: purge all at once and check return code [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) [15:09:25] (03PS7) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:09:33] (03PS7) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) [15:09:48] (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:10:20] (03PS2) 10Filippo Giunchedi: structured-data: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898768 (https://phabricator.wikimedia.org/T309182) [15:10:22] (03CR) 10Filippo Giunchedi: [V: 03+2] structured-data: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898768 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [15:13:19] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:16:00] (03PS2) 10FNegri: [tbs.harbor] Fix wrong paths for Harbor certs [puppet] - 10https://gerrit.wikimedia.org/r/898773 (https://phabricator.wikimedia.org/T316323) [15:16:41] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [15:17:58] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [15:18:02] (03PS7) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [15:19:06] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging2003.codfw.wmnet with OS bullseye [15:19:25] (03PS8) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:19:32] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10Papaul) @MatthewVernon i repalced 6. let me which other one is having issues ` Physical Disk 0:1:0 Online 0 3725.50 GB Not Capable SATA HDD No Not Applicable Phy... [15:19:38] (03Abandoned) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:19:48] (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:19:53] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:20:53] (03PS9) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:21:14] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:21:59] 10SRE, 10serviceops: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) The backports are complete and support Unicode 13 now! ` jmm@jmm-mw-icu67:~$ php -r "var_dump(IntlChar::getUnicodeVersion());" array(4) { [0]=> int(13) [1]=> int(0) [2]=> int(0)... [15:23:48] (03PS10) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:23:52] (03PS2) 10JHathaway: kernel-purge: purge all at once and check return code [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) [15:24:33] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:25:20] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40109/console" [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [15:28:34] 10SRE, 10Infrastructure-Foundations, 10decommission-hardware: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10jbond) p:05Triage→03Medium a:03jbond [15:30:02] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10herron) [15:30:15] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pki2001.codfw.wmnet with reason: decommission [15:30:41] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pki2001.codfw.wmnet with reason: decommission [15:30:50] 10SRE, 10Infrastructure-Foundations, 10decommission-hardware: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4697f9e6-30a6-446d-b67d-d99317a73ab5) set by jbond@cumin1001 for 5 days, 0:00:00 on 1 host(s) and thei... [15:31:03] 10SRE, 10Infrastructure-Foundations, 10decommission-hardware: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10jbond) [15:32:56] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2003.codfw.wmnet with reason: host reimage [15:35:37] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [15:35:39] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [15:36:15] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2003.codfw.wmnet with reason: host reimage [15:37:26] (03PS1) 10Filippo Giunchedi: structured-data: deploy to ops/eqiad only [alerts] - 10https://gerrit.wikimedia.org/r/898783 (https://phabricator.wikimedia.org/T309182) [15:38:00] (03PS1) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) [15:38:54] (03PS11) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:39:54] (03CR) 10CI reject: [V: 04-1] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:40:41] (03PS1) 10Jbond: pki: move to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/898785 (https://phabricator.wikimedia.org/T332018) [15:40:58] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:42:12] (03CR) 10Jbond: [C: 03+2] pki: move to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/898785 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [15:42:39] (03PS2) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) [15:43:25] (03PS3) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) [15:46:10] (03PS12) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:46:26] (03PS13) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:48:52] (03CR) 10Andrew Bogott: [C: 03+1] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:49:12] (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:50:41] (03CR) 10JHathaway: [V: 03+1 C: 03+2] kernel-purge: purge all at once and check return code [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [15:51:25] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) @MatthewVernon hey i am about to go pickup the disk i need for you to ping me on dc-ops channel so we can coordinate the replacing of both disks. Thanks [15:52:40] (03CR) 10Muehlenhoff: pki: move to spare::system (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898785 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [15:53:47] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [15:53:49] (03PS1) 10Jbond: site.pp: move pki2001 to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/898790 (https://phabricator.wikimedia.org/T332033) [15:53:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:35] (03CR) 10AikoChou: [C: 03+1] ml-services: tune autoscaling for ORES model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [15:54:55] (03CR) 10Jbond: [C: 03+2] pki: move to spare::system (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898785 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [15:55:35] (03CR) 10Jbond: [C: 03+2] site.pp: move pki2001 to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/898790 (https://phabricator.wikimedia.org/T332033) (owner: 10Jbond) [15:56:10] (03CR) 10Jbond: [C: 03+2] "and probably didn't need this one either 😊" [puppet] - 10https://gerrit.wikimedia.org/r/898790 (https://phabricator.wikimedia.org/T332033) (owner: 10Jbond) [15:58:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:28] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2003.codfw.wmnet with OS bullseye [15:59:31] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:00:02] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts pki2001.codfw.wmnet [16:00:04] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:03:15] (03PS1) 10Jelto: install_server: use second pair of disks for /srv/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) [16:03:51] (03PS1) 10Jbond: pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) [16:04:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:04:04] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 12:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Bootstrapping ceph [16:04:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 12:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Bootstrapping ceph [16:04:55] (03CR) 10JMeybohm: [C: 03+1] pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [16:06:02] (03CR) 10Filippo Giunchedi: [C: 03+2] structured-data: deploy to ops/eqiad only [alerts] - 10https://gerrit.wikimedia.org/r/898783 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:06:04] (03PS1) 10Herron: profile::kafka::broker::monitoring: remove under replicated icinga check [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010) [16:06:28] (03CR) 10Jbond: [C: 03+2] pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [16:06:35] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10bd808) >>! In T320794#8690975, @SLyngshede-WMF wrote: > Attributes we will be needing: > > - wikimediaGlobalAccountName (MediaWiki SUL account name) (optional) Be aw... [16:07:30] (03CR) 10Jbond: [C: 03+2] pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [16:08:27] (03PS2) 10Herron: profile::kafka::broker::monitoring: remove under replicated icinga check [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010) [16:09:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:10:15] (03CR) 10Jelto: "Hi Moritz. Can you check the partman config? I want to move /srv/gitlab-backup to two new disks (raid 1). I removed the volume from lvm an" [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [16:10:20] (03PS3) 10Herron: profile::kafka::broker::monitoring: remove under replicated icinga check [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010) [16:10:45] (JobUnavailable) firing: Reduced availability for job cfssl in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:50] (03Merged) 10jenkins-bot: pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [16:10:58] (03PS1) 10Daniel Kinzler: Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) [16:11:37] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [16:11:40] (03CR) 10CI reject: [V: 04-1] Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [16:13:24] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pki2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [16:15:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF) [16:16:17] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pki2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [16:16:17] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:16:17] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pki2001.codfw.wmnet [16:16:26] 10SRE, 10Infrastructure-Foundations, 10decommission-hardware, 10Patch-For-Review: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `pki2001.codfw.wmnet` - pki2001.codfw.wmnet (**WARN**... [16:19:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF) [16:19:58] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) 05Open→03Resolved Both 21 and 23 replaced ` Physical Disk 0:2:21 Online 21 7451.5 GB SATA HDD No Physical Disk 0:2:22 Online 22 7451.5 GB SATA... [16:20:06] (03CR) 10Dzahn: [C: 03+1] install_server: use second pair of disks for /srv/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [16:20:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF) [16:20:45] (JobUnavailable) resolved: Reduced availability for job cfssl in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:21:05] (03CR) 10Herron: [C: 03+1] thanos: add pint for thanos-rule [puppet] - 10https://gerrit.wikimedia.org/r/898701 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:22:25] (03PS2) 10Daniel Kinzler: Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) [16:23:10] (03CR) 10CI reject: [V: 04-1] Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [16:23:44] (03PS1) 10Jbond: P:mariadb: drop pki2001 from grants [puppet] - 10https://gerrit.wikimedia.org/r/898800 (https://phabricator.wikimedia.org/T332018) [16:23:48] (03PS1) 10Jbond: pki2001: remove last refrences to pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/898801 (https://phabricator.wikimedia.org/T332018) [16:24:23] PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdy1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [16:24:24] (03PS2) 10Jbond: pki2001: remove last refrences to pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/898801 (https://phabricator.wikimedia.org/T332018) [16:24:29] (03CR) 10Jbond: [C: 03+2] pki2001: remove last refrences to pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/898801 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [16:26:49] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware, 10Patch-For-Review: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10jbond) [16:27:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10NHillard-WMF) I approve this access as Kim's manager - thanks! [16:28:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010) (owner: 10Herron) [16:28:52] (03PS1) 10Btullis: Fix an error with the ceph::server::firewall profile. [puppet] - 10https://gerrit.wikimedia.org/r/898803 (https://phabricator.wikimedia.org/T330149) [16:29:10] 10SRE, 10Infrastructure-Foundations: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10jbond) 05Open→03Resolved a:03jbond [16:29:47] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) Most of the installer logic has been adapted for Bookworm, but there's one puzzling issue impacting the retrieval of our preseeded partitioning config. In https://gi... [16:30:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF) [16:30:11] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40110/console" [puppet] - 10https://gerrit.wikimedia.org/r/898803 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [16:31:31] (03PS3) 10Daniel Kinzler: Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) [16:34:44] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix an error with the ceph::server::firewall profile. [puppet] - 10https://gerrit.wikimedia.org/r/898803 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [16:36:37] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh1002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (stale by 1726017.11s). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [16:36:39] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh6002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (stale by 1726522.19s). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [16:37:16] oh ha [16:37:26] this check seemed like a weird idea but it's so handy [16:38:01] PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:07] 10SRE-OnFire, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10akosiaris) Removing #SRE as the more specific working group is tagged already. [16:40:19] RECOVERY - Host ms-be2067 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [16:42:35] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:59] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:21] (03PS5) 10JMeybohm: Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) [16:44:33] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:44:35] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:44:38] (03CR) 10Effie Mouzeli: Assign mediawiki roles to mw2420-mw2451 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert) [16:44:47] RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [16:47:06] !log rolling restart of pdns-rec to pick up config changes [16:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:15] !log rolling restart of pdns-rec in A:wikidough to pick up config changes [16:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [16:50:02] (03CR) 10JMeybohm: [C: 03+1] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [16:51:25] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh5001 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (stale by 1725395.80s). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [16:51:43] ^ should resolve shortly [16:52:51] (03PS1) 10Arturo Borrero Gonzalez: hieradata: acme_chief: authorize new cloudlb hosts to access cert [puppet] - 10https://gerrit.wikimedia.org/r/898808 (https://phabricator.wikimedia.org/T324992) [16:53:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: acme_chief: authorize new cloudlb hosts to access cert [puppet] - 10https://gerrit.wikimedia.org/r/898808 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [16:54:38] (03CR) 10Jbond: "fly by comment" [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [16:55:27] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10akosiaris) Remove #wikimedia-incident-actionable since I fail to find an action item that fits this projects description `Action items that came out of t... [16:56:27] (03CR) 10JHathaway: [V: 03+1 C: 03+2] kernel-purge: purge all at once and check return code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [16:57:13] RECOVERY - haproxy process on cloudlb2003-dev is OK: PROCS OK: 1 process with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [16:57:25] RECOVERY - haproxy process on cloudlb2002-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [16:57:39] RECOVERY - Check systemd state on cloudlb2003-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:59] RECOVERY - Check systemd state on cloudlb2002-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:19] 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10akosiaris) Removing #SRE, adding the more specific SRE subteam that can probably drive this forward. [16:59:06] 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1700) [17:01:29] 10SRE-OnFire, 10Traffic, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:02:17] RECOVERY - haproxy alive on cloudlb2003-dev is OK: OK check_alive uptime 303s https://wikitech.wikimedia.org/wiki/HAProxy [17:02:50] 10SRE, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10akosiaris) As a note, I am unsure which team to triage this to during sprint week. [17:02:58] (03CR) 10JMeybohm: Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [17:03:13] RECOVERY - haproxy alive on cloudlb2002-dev is OK: OK check_alive uptime 413s https://wikitech.wikimedia.org/wiki/HAProxy [17:05:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) 05Resolved→03Open Hi @Papaul Sorry, but there's still a missing disk in this system - `Virtual Drive 21` is absent, which I think is `Slot Number: 19` Could you... [17:05:43] 10SRE-OnFire, 10SRE Observability, 10Sustainability (Incident Followup): create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10akosiaris) Apparently #sre_observability team has it in their board, I am tagging with that and removing #SRE [17:06:07] (03PS7) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) [17:06:35] 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10MatthewVernon) [for reference, I was on leave on 2022-08-24] [17:07:31] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40111/console" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [17:07:33] 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10akosiaris) @dzahn anything left to do here? Would it be a good sprint week thing? [17:07:41] 10SRE-OnFire, 10Observability-Logging, 10Sustainability (Incident Followup): create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10colewhite) [17:08:00] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:09:51] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) ...this may be a different disk that's failed, so maybe it just got unseated? [17:10:47] (03Abandoned) 10Anzx: Remove FlaggedRevs for ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) (owner: 10Anzx) [17:11:03] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2003-dev.codfw.wmnet with OS bullseye [17:13:07] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10akosiaris) [17:14:29] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10akosiaris) 05Open→03Resolved @cdanis, No other actionables showed up, alerti... [17:14:39] (03PS1) 10Hnowlan: cassandra: add stub secret for device_analytics [labs/private] - 10https://gerrit.wikimedia.org/r/898810 (https://phabricator.wikimedia.org/T320967) [17:15:22] (03PS3) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755 [17:16:24] (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:16:41] (03PS1) 10Andrew Bogott: Trove: increase timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/898811 [17:16:59] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] cassandra: add stub secret for device_analytics [labs/private] - 10https://gerrit.wikimedia.org/r/898810 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [17:17:05] 10SRE, 10SRE-OnFire, 10wikitech.wikimedia.org, 10User-LSobanski: Incident response tools operational readiness review - https://phabricator.wikimedia.org/T290130 (10akosiaris) Removing #wikimedia-incident-actionable as it isn't clear by mention of task, incident doc, status doc, task or something similar h... [17:17:28] (03CR) 10Andrew Bogott: [C: 03+2] Trove: increase timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/898811 (owner: 10Andrew Bogott) [17:18:37] (03PS3) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) [17:19:13] (03CR) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [17:20:19] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40113/console" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [17:20:40] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10serviceops, 10Sustainability (Incident Followup): Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam (2 o... [17:21:04] (03PS4) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) [17:21:34] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10akosiaris) Removing #SRE, has already been triaged to a more... [17:22:23] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40114/console" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [17:23:54] 10SRE, 10PyBal, 10Traffic-Icebox, 10Sustainability (Incident Followup): Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10akosiaris) Removing #SRE, traging to #traffic-icebox since this is pybal and is pretty old. [17:24:14] (03CR) 10Elukey: [C: 03+2] ml-services: tune autoscaling for ORES model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [17:25:27] (03PS1) 10CDanis: move jameel to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/898814 [17:26:05] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10Structured Data Engineering, and 5 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:26:25] (03CR) 10CDanis: [C: 03+2] move jameel to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/898814 (owner: 10CDanis) [17:29:38] 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decomission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10andrea.denisse) a:05andrea.denisse→03None [17:29:58] (03PS1) 10JHathaway: netboot: fix a few syntax errors [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495) [17:30:02] (03PS4) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755 [17:30:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos) [17:31:32] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40117/console" [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [17:32:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos) [17:32:19] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Sustainability (Incident Followup), 10User-Joe: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551 (10akosiaris) 05Open→03Resolved a:03akosiaris No u... [17:32:25] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537 (10akosiaris) [17:32:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos) [17:32:48] (03CR) 10David Caro: Improvements to maintain-dbusers and the rest-api (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:34:05] 10SRE, 10observability, 10Epic, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942 (10akosiaris) 05Open→03Resolved a:03akosiaris No comments in 6 years, I am gonna tentatively r... [17:35:31] (03PS5) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755 [17:36:32] (03PS2) 10JHathaway: netboot: fix a few syntax errors [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495) [17:37:00] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40118/console" [puppet] - 10https://gerrit.wikimedia.org/r/898755 (owner: 10David Caro) [17:37:04] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [17:37:16] (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:37:35] 10SRE, 10Release-Engineering-Team (Seen), 10Sustainability (Incident Followup): Review new service 'pre-deployment to production' checklist - https://phabricator.wikimedia.org/T141897 (10akosiaris) 05Open→03Resolved a:03akosiaris No comments, updates or anything for that matter since the opening of thi... [17:37:45] (03CR) 10JHathaway: [C: 03+2] netboot: fix a few syntax errors [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [17:37:49] (03PS7) 10Hnowlan: helmfile: add device-analytics configuration, namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) [17:37:53] (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:39:52] 10SRE-swift-storage, 10Commons: Commons files missing - https://phabricator.wikimedia.org/T332019 (10Aklapper) [17:40:12] 10SRE-swift-storage, 10Commons: 404 error for image thumbnail file on Commons - https://phabricator.wikimedia.org/T332019 (10Aklapper) [17:42:57] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136 (10akosiaris) Removing SRE, adding #data-persistence [17:45:20] 10SRE, 10Sustainability (Incident Followup): Detect high server load earlier – prometheus alert? - https://phabricator.wikimedia.org/T188317 (10akosiaris) 05Open→03Resolved a:03akosiaris High server load isn't a good metric of anything, just a symptom/indication that something is wrong. We now have quite... [17:45:52] (03CR) 10Hnowlan: [C: 03+2] helmfile: add device-analytics configuration, namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [17:49:40] 10SRE-swift-storage, 10Commons: 404 error for image thumbnail file on Commons - https://phabricator.wikimedia.org/T332019 (10Aklapper) [17:49:55] 10SRE-swift-storage, 10Commons: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper) [17:52:19] 10SRE-swift-storage, 10Commons: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper) [17:52:22] !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:52:23] !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:52:54] !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:52:55] 10SRE, 10serviceops, 10Sustainability (Incident Followup): docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris) [17:53:04] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Volans) As an update, with the current situation even if a ho... [17:53:04] !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:53:09] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): create swift container-to-container synchronization metrics - https://phabricator.wikimedia.org/T229117 (10akosiaris) 05Open→03Declined I am gonna close this as Declined. Many years later I don't remember any... [17:53:18] 10SRE, 10serviceops, 10Sustainability (Incident Followup): docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris) [17:53:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): create a docker_registry_codfw swift container backup - https://phabricator.wikimedia.org/T229118 (10akosiaris) 05Open→03Declined I am gonna close this as Declined. Many years later I don't remember any more... [17:53:44] (03Merged) 10jenkins-bot: helmfile: add device-analytics configuration, namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [17:55:28] !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:55:42] 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136 (10Kappakayala) [17:55:59] !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:56:20] !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:57:35] (03CR) 10David Caro: Improvements to maintain-dbusers and the rest-api (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:58:07] (03PS4) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) [17:58:37] !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:59:01] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:59:25] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:59:59] (03CR) 10CI reject: [V: 04-1] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:00:01] (03PS5) 10David Caro: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:00:04] brennen and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1800). [18:01:55] (03CR) 10CI reject: [V: 04-1] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:01:58] o/ [18:03:44] !log 1.40.0-wmf.27 train (T330205): no current blockers, rolling to group0. [18:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:49] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [18:05:05] (03PS6) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) [18:05:47] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898818 (https://phabricator.wikimedia.org/T330205) [18:05:51] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898818 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [18:06:03] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [18:06:08] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [18:06:46] (03PS7) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) [18:06:49] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898818 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [18:10:57] (03PS1) 10Btullis: Fix the srange format for the ceph servers srange [puppet] - 10https://gerrit.wikimedia.org/r/898819 (https://phabricator.wikimedia.org/T330149) [18:12:44] (03CR) 10Btullis: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40119/console" [puppet] - 10https://gerrit.wikimedia.org/r/898803 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [18:13:08] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging2002.codfw.wmnet with OS bullseye [18:13:34] (03PS1) 10Hnowlan: device-analytics: add missing mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/898820 (https://phabricator.wikimedia.org/T320967) [18:13:36] (03CR) 10David Caro: [C: 03+2] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:13:48] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.27 refs T330205 [18:13:53] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [18:13:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:15:28] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [18:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:22:25] !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [18:22:33] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10bd808) >>! In T320794#8693751, @SLyngshede-WMF wrote: > We actually have uid, cn and sn which are all by default set to the developer account username. Possibly pedantic,... [18:22:55] !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 30s) [18:23:33] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [18:24:24] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [18:25:34] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [18:25:42] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [18:25:59] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [18:27:09] (03CR) 10Btullis: [C: 03+2] Fix the srange format for the ceph servers srange [puppet] - 10https://gerrit.wikimedia.org/r/898819 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [18:27:33] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2002.codfw.wmnet with reason: host reimage [18:27:54] !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [18:28:04] (03PS1) 10Hnowlan: cassandra: fix device_analytics creation syntax [puppet] - 10https://gerrit.wikimedia.org/r/898824 (https://phabricator.wikimedia.org/T320967) [18:28:06] !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 11s) [18:30:17] (03CR) 10Eevans: [C: 03+1] cassandra: fix device_analytics creation syntax [puppet] - 10https://gerrit.wikimedia.org/r/898824 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [18:32:29] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2002.codfw.wmnet with reason: host reimage [18:42:42] (03CR) 10Effie Mouzeli: [C: 03+1] "Worth having a go" [deployment-charts] - 10https://gerrit.wikimedia.org/r/898728 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [18:43:57] (03PS1) 10David Caro: replica_cnf: use correct paws userhomes path [puppet] - 10https://gerrit.wikimedia.org/r/898831 [18:44:49] (03CR) 10Andrew Bogott: [C: 03+2] replica_cnf: use correct paws userhomes path [puppet] - 10https://gerrit.wikimedia.org/r/898831 (owner: 10David Caro) [18:51:39] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2002.codfw.wmnet with OS bullseye [18:57:39] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [19:01:35] 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [19:02:22] (03PS1) 10Krinkle: rdbms: Add db_log_category=performance to TransactionProfiler [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898725 [19:02:25] 10SRE, 10Traffic, 10Patch-For-Review: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) 05Open→03Resolved a:03ssingh Closing this in favour of T321309 where it is being tracked and also given that the Ganeti reimaging cookbook exists which was the prim... [19:02:33] 10SRE, 10Traffic: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [19:12:00] hello. anyone around who could check on a maintenance script run for me? i hope it has finished. https://phabricator.wikimedia.org/T315510#8656935 [19:13:05] I'll have a look [19:13:26] MatmaRex: says 'Processed 2315500 (updated 1154395) of 7380439 rows' [19:14:10] ugh okay. thanks [19:14:19] isn't that going slower than before the switchover? [19:14:35] maybe? [19:14:41] it feels very slow either way [19:14:51] hm, when are we switching back? will it finish in time? :P [19:16:07] hm. given I had to restart it once, maybe the 'processed X of Y' numbers are not accurate? since X would be number of row processed on this run but Y number of total rows [19:16:25] the total number is wrong [19:16:55] but ladsgroup made me do it :D [19:17:22] i'm not sure how many rows there are to really process. maybe 3M or so [19:18:26] if we do end up restarting it, we should do it with the --start parameter as printed by the script [19:20:09] oh, i guess you did it? i'm not sure what you meant in your message [19:21:42] 10SRE-swift-storage, 10Commons: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Yann) Other people got the same error: https://commons.wikimedia.org/wiki/Commons:Village_pump#Files_are_not_appearing. [19:21:58] yes, I did it with --start. I just noted that even with --start the number of processed rows printed is processed since starting the script that time, not total, so the speed is not comparable with runs without a restart in the middle [19:23:49] mhm [19:24:06] thanks for doing that! [19:25:06] i wonder if it would be okay to run these in parallel on several wikis [19:30:40] !log 1.40.0-wmf.27 train (T330205): uneventful at group0. i'm afk for about an hour. [19:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:45] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [19:32:51] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [19:32:56] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [19:40:48] (03CR) 10David Caro: [C: 03+2] Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:41:05] (03CR) 10David Caro: [C: 03+2] Improvements to maintain-dbusers and the rest-api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:44:30] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt with reason: cloudsw1-b1-codfw OS upgrade [19:44:46] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt with reason: cloudsw1-b1-codfw OS upgrade [19:44:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=484494a0-6cb6-44... [19:47:17] !log Reboot cloudsw1-b1-codfw to upgrade JunOS version T327919 [19:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:22] T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 [19:51:00] PROBLEM - Host cloudlb2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [19:53:26] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:54:58] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 139, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:59:14] (03PS1) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898843 (https://phabricator.wikimedia.org/T331313) [19:59:18] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898844 (https://phabricator.wikimedia.org/T329407) [19:59:20] (03PS1) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313) [19:59:22] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407) [20:00:00] (03CR) 10CI reject: [V: 04-1] Enable new Vector (2022) "Add topic" button at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898843 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:05] (03CR) 10CI reject: [V: 04-1] Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński) [20:00:09] (03CR) 10CI reject: [V: 04-1] Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński) [20:00:11] (03CR) 10CI reject: [V: 04-1] Enable DiscussionTools usability improvements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898844 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński) [20:01:43] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper) [20:02:20] RECOVERY - Host cloudlb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [20:02:27] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper) [20:02:36] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:40] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper) p:05Triage→03Unbreak! [20:02:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:03:42] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [20:03:47] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)... [20:04:02] (03PS8) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) [20:04:14] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [20:04:19] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [20:05:12] (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:06:48] (03CR) 10David Caro: [C: 03+2] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:09:39] (03PS1) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) [20:10:10] (03CR) 10David Caro: [V: 03+1 C: 03+2] wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755 (owner: 10David Caro) [20:10:15] (03CR) 10CI reject: [V: 04-1] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [20:20:18] (03PS2) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) [20:20:52] (03CR) 10CI reject: [V: 04-1] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [20:24:37] (03PS2) 10Zabe: dewiki: Allow 'crats to remove sysopship and manage importers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921) [20:24:52] jouncebot: nowandnext [20:24:52] For the next 0 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T2000) [20:24:52] In 9 hour(s) and 35 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T0600) [20:24:59] (03CR) 10Zabe: [C: 03+2] dewiki: Allow 'crats to remove sysopship and manage importers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe) [20:25:12] (03PS1) 10David Caro: replica_cnf: skip toolforge users without a home [puppet] - 10https://gerrit.wikimedia.org/r/898852 [20:26:11] (03Merged) 10jenkins-bot: dewiki: Allow 'crats to remove sysopship and manage importers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe) [20:27:03] !log zabe@deploy2002 Started scap: Backport for [[gerrit:897997|dewiki: Allow 'crats to remove sysopship and manage importers (T331921)]] [20:27:09] T331921: enable de-wp bureaucrats to remove adminflag and to grant importer rights - https://phabricator.wikimedia.org/T331921 [20:28:20] 10SRE, 10Traffic: Cleanup and refactor the dnsrecursor module - https://phabricator.wikimedia.org/T332083 (10ssingh) [20:28:41] !log zabe@deploy2002 zabe: Backport for [[gerrit:897997|dewiki: Allow 'crats to remove sysopship and manage importers (T331921)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:28:47] 10SRE, 10Traffic: Clean up and refactor the dnsrecursor module - https://phabricator.wikimedia.org/T332083 (10ssingh) [20:29:13] 10SRE, 10Traffic: Clean up and refactor the dnsrecursor module - https://phabricator.wikimedia.org/T332083 (10ssingh) p:05Triage→03Medium [20:33:43] (03PS2) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898843 (https://phabricator.wikimedia.org/T331313) [20:33:45] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898844 (https://phabricator.wikimedia.org/T329407) [20:33:47] (03PS2) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313) [20:33:49] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407) [20:35:39] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:897997|dewiki: Allow 'crats to remove sysopship and manage importers (T331921)]] (duration: 08m 36s) [20:35:45] T331921: enable de-wp bureaucrats to remove adminflag and to grant importer rights - https://phabricator.wikimedia.org/T331921 [20:36:46] 10SRE-swift-storage, 10Commons: 404 error for image thumbnail file on Commons - https://phabricator.wikimedia.org/T332019 (10Yann) Duplicate to T331820 [20:39:29] (03PS3) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) [20:40:05] (03CR) 10CI reject: [V: 04-1] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [20:41:32] PROBLEM - Ensure mysql credential creation for tools users is running on cloudcontrol1006 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:41:53] (03PS4) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) [20:47:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:47:41] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [20:47:45] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)... [20:47:58] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [20:48:04] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [20:52:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:01] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 227.9k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [21:02:04] (03PS2) 10Raymond Ndibe: replica_cnf: skip toolforge users without a home [puppet] - 10https://gerrit.wikimedia.org/r/898852 (owner: 10David Caro) [21:02:48] PROBLEM - Ensure mysql credential creation for tools users is running on cloudcontrol1007 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:09:15] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Don-vip) Another example: https://commons.wikimedia.org/wiki/File:The_Sky_is_Not_the_Limit;_There_Are_Footprints_on_the_Moon_(5932386).jpeg https://upload.wikimedia.org/wikipedia/commons/th... [21:11:04] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [21:11:07] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)... [21:11:22] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [21:11:28] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [21:11:34] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [21:11:37] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)... [21:13:32] (03PS1) 10Umherirrender: action: Restrict action.delete.js to action=delete pages [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898867 (https://phabricator.wikimedia.org/T330205) [21:16:57] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [21:17:08] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [21:19:01] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [21:20:58] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [21:38:00] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [21:38:20] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [22:08:35] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [22:25:27] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [22:34:01] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [22:34:34] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [22:37:28] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh1002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [22:37:28] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh6002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [22:50:26] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [22:52:16] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh5001 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [22:58:48] jouncebot nowandnext [22:58:48] No deployments scheduled for the next 7 hour(s) and 1 minute(s) [22:58:48] In 7 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T0600) [22:59:37] slinging out https://gerrit.wikimedia.org/r/c/mediawiki/core/+/898867/ [23:19:22] !log brennen@deploy2002 Started scap: Backport for [[gerrit:898867|action: Restrict action.delete.js to action=delete pages (T330205)]] [23:19:28] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [23:20:55] !log brennen@deploy2002 brennen and umherirrender: Backport for [[gerrit:898867|action: Restrict action.delete.js to action=delete pages (T330205)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [23:29:55] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:898867|action: Restrict action.delete.js to action=delete pages (T330205)]] (duration: 10m 32s) [23:30:00] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205