[00:03:55] <icinga-wm>	 PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[00:13:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P45824 and previous config saved to /var/cache/conftool/dbconfig/20230314-001313-marostegui.json
[00:17:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] rbd2backy2.py: handle empty 'expire' stamps [puppet] - 10https://gerrit.wikimedia.org/r/897955 (owner: 10Andrew Bogott)
[00:21:25] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[00:28:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T329260)', diff saved to https://phabricator.wikimedia.org/P45825 and previous config saved to /var/cache/conftool/dbconfig/20230314-002819-marostegui.json
[00:28:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[00:28:25] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[00:28:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[00:28:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T329260)', diff saved to https://phabricator.wikimedia.org/P45826 and previous config saved to /var/cache/conftool/dbconfig/20230314-002840-marostegui.json
[00:39:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T329260)', diff saved to https://phabricator.wikimedia.org/P45827 and previous config saved to /var/cache/conftool/dbconfig/20230314-003903-marostegui.json
[00:39:09] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[00:48:23] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:48:25] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:48:33] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:54:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P45828 and previous config saved to /var/cache/conftool/dbconfig/20230314-005409-marostegui.json
[00:57:25] <icinga-wm>	 RECOVERY - Check systemd state on aphlict2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:07:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10phaultfinder)
[01:09:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P45829 and previous config saved to /var/cache/conftool/dbconfig/20230314-010915-marostegui.json
[01:14:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:19:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:24:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T329260)', diff saved to https://phabricator.wikimedia.org/P45830 and previous config saved to /var/cache/conftool/dbconfig/20230314-012421-marostegui.json
[01:24:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[01:24:28] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[01:24:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[01:24:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T329260)', diff saved to https://phabricator.wikimedia.org/P45831 and previous config saved to /var/cache/conftool/dbconfig/20230314-012442-marostegui.json
[01:29:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10Papaul) @wiki_willy thank you for the heads up. @MatthewVernon i checked the systemc
[01:35:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T329260)', diff saved to https://phabricator.wikimedia.org/P45832 and previous config saved to /var/cache/conftool/dbconfig/20230314-013504-marostegui.json
[01:35:12] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[01:50:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P45833 and previous config saved to /var/cache/conftool/dbconfig/20230314-015011-marostegui.json
[01:59:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0200)
[02:04:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:05:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P45834 and previous config saved to /var/cache/conftool/dbconfig/20230314-020517-marostegui.json
[02:07:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.27 [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/897929 (https://phabricator.wikimedia.org/T330205)
[02:07:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.27 [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/897929 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot)
[02:09:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T329260)', diff saved to https://phabricator.wikimedia.org/P45835 and previous config saved to /var/cache/conftool/dbconfig/20230314-022023-marostegui.json
[02:20:30] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[02:22:21] <legoktm>	 !log removed user's 2FA on wikitech for T331955
[02:22:22] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) Your dispatch shipped on 3/13/2023 4:41 PM
[02:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:24:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.27 [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/897929 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot)
[02:24:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:31:43] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Samwilson)
[02:34:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:44:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:49:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0300)
[03:00:23] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:14] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898245 (https://phabricator.wikimedia.org/T330205)
[03:01:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898245 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot)
[03:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898245 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot)
[03:02:27] <logmsgbot>	 !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.40.0-wmf.27  refs T330205
[03:02:34] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205
[03:07:07] <wikibugs>	 (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: Short leveling up notification delay for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898246 (https://phabricator.wikimedia.org/T330358)
[03:18:13] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:29:05] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:53:30] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.40.0-wmf.27  refs T330205 (duration: 51m 02s)
[03:53:35] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205
[03:55:52] <logmsgbot>	 !log mwpresync@deploy2002 Pruned MediaWiki: 1.40.0-wmf.25 (duration: 02m 20s)
[04:56:39] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.122`. Pre-deploy tests passing on canary `wdqs1003`
[04:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:56:53] <logmsgbot>	 !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@61ef435]: 0.3.122
[04:57:23] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.122` on canary `wdqs1003`; proceeding to rest of fleet
[04:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:05:39] <logmsgbot>	 !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@61ef435]: 0.3.122 (duration: 08m 45s)
[05:07:07] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[05:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:07:11] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[05:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:07:22] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[05:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0600).
[06:04:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance
[06:04:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance
[06:16:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance
[06:16:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance
[06:16:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T329260)', diff saved to https://phabricator.wikimedia.org/P45836 and previous config saved to /var/cache/conftool/dbconfig/20230314-061633-marostegui.json
[06:16:39] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[06:19:41] <wikibugs>	 (03PS8) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505)
[06:24:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:28:37] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts centrallog1001
[06:32:00] <wikibugs>	 (03PS1) 10Marostegui: production-m2.sql.erb: New user for excimer database [puppet] - 10https://gerrit.wikimedia.org/r/898447 (https://phabricator.wikimedia.org/T331956)
[06:33:09] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:34:10] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[06:34:29] <wikibugs>	 (03CR) 10Marostegui: "This requires manual creation on the database. This patch is just for tracking purposes" [puppet] - 10https://gerrit.wikimedia.org/r/898447 (https://phabricator.wikimedia.org/T331956) (owner: 10Marostegui)
[06:34:43] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:34:53] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[06:35:27] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi)
[06:36:11] <wikibugs>	 (03PS2) 10Marostegui: production-m2.sql.erb: New user for excimer database [puppet] - 10https://gerrit.wikimedia.org/r/898447 (https://phabricator.wikimedia.org/T331956)
[06:36:18] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: centrallog1001 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001"
[06:41:46] <hashar>	 !log gerrit: changed `operations/puppet` merge strategy to allow "content merges" (see `ops` list for the rationale)
[06:41:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:46] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: centrallog1001 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001"
[06:42:46] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:42:47] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts centrallog1001
[06:43:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: New user for excimer database [puppet] - 10https://gerrit.wikimedia.org/r/898447 (https://phabricator.wikimedia.org/T331956) (owner: 10Marostegui)
[06:46:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T329260)', diff saved to https://phabricator.wikimedia.org/P45837 and previous config saved to /var/cache/conftool/dbconfig/20230314-064630-marostegui.json
[06:46:36] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:01:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45838 and previous config saved to /var/cache/conftool/dbconfig/20230314-070137-marostegui.json
[07:10:30] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Migrate db2135 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/898456 (https://phabricator.wikimedia.org/T322294)
[07:11:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Migrate db2135 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/898456 (https://phabricator.wikimedia.org/T322294) (owner: 10Marostegui)
[07:13:53] <marostegui>	 !log Migrate db2135 to mariadb m5 codfw dbmaint 10.6
[07:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45839 and previous config saved to /var/cache/conftool/dbconfig/20230314-071643-marostegui.json
[07:25:59] <marostegui>	 !log Migrate db1183 to mariadb m5 eqiad dbmaint 10.6 T322294
[07:26:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:04] <stashbot>	 T322294: Migrate m5 section to MariaDB 10.6 - https://phabricator.wikimedia.org/T322294
[07:26:38] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Migrate db1183 to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/898672 (https://phabricator.wikimedia.org/T322294)
[07:27:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Migrate db1183 to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/898672 (https://phabricator.wikimedia.org/T322294) (owner: 10Marostegui)
[07:31:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T329260)', diff saved to https://phabricator.wikimedia.org/P45840 and previous config saved to /var/cache/conftool/dbconfig/20230314-073149-marostegui.json
[07:31:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance
[07:31:55] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[07:32:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance
[07:32:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T329260)', diff saved to https://phabricator.wikimedia.org/P45841 and previous config saved to /var/cache/conftool/dbconfig/20230314-073210-marostegui.json
[07:57:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T329260)', diff saved to https://phabricator.wikimedia.org/P45842 and previous config saved to /var/cache/conftool/dbconfig/20230314-075730-marostegui.json
[07:57:37] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[08:00:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Configure database size for MDB backend [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff)
[08:04:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) @karapayneWMDE : This needs your sign off on the WMDE side. @thcipriani : This needs your approval for the deployment access
[08:05:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff)
[08:08:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Add itamar to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/898675 (https://phabricator.wikimedia.org/T331899)
[08:12:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) >>! In T331647#8688948, @xcollazo wrote: > Hal needs to deploy to the `platform-eng` Airflow instance. So he needs `platform-eng-deployers`?  That's the correct group, yes (although per t...
[08:12:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45843 and previous config saved to /var/cache/conftool/dbconfig/20230314-081236-marostegui.json
[08:20:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Restrict prefix length for public announce, allow bgp for cloud range (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/896200 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[08:27:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45845 and previous config saved to /var/cache/conftool/dbconfig/20230314-082743-marostegui.json
[08:31:45] <icinga-wm>	 RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:31:50] <vgutierrez>	 !log fetch haproxy 2.6.10 for thirdparty/haproxy26 (buster && bullseye) @ apt.wm.o
[08:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:04] <wikibugs>	 (03PS1) 10Cathal Mooney: Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919)
[08:32:51] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[08:34:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:34:53] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "1 small comment then lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[08:36:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "Almost forgot, it needs to go with a change in config/sites.yaml as well." [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[08:38:48] <vgutierrez>	 !log test HAProxy 2.6.10 in cp4044 and cp4045
[08:38:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T329260)', diff saved to https://phabricator.wikimedia.org/P45846 and previous config saved to /var/cache/conftool/dbconfig/20230314-084249-marostegui.json
[08:42:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[08:42:56] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[08:42:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:43:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[08:44:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:47:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) Indeed! looks like Cisco specific :(  I sent an email t our account rep just in case: > Additionally I was wondering if Junos supported in any way forwardingStatus in IP...
[08:47:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:49:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:49:58] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Vgutierrez)
[08:50:13] <wikibugs>	 (03PS1) 10Filippo Giunchedi: karma: change 'source' label color [puppet] - 10https://gerrit.wikimedia.org/r/898682
[08:52:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Trivial change thus self-merge" [puppet] - 10https://gerrit.wikimedia.org/r/898682 (owner: 10Filippo Giunchedi)
[08:53:00] <wikibugs>	 (03PS3) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858)
[09:01:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/897950 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[09:02:26] <wikibugs>	 (03PS4) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858)
[09:03:37] <wikibugs>	 (03PS5) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858)
[09:04:47] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[09:04:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:06:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[09:06:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[09:06:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T329260)', diff saved to https://phabricator.wikimedia.org/P45847 and previous config saved to /var/cache/conftool/dbconfig/20230314-090649-marostegui.json
[09:06:50] <wikibugs>	 (03Merged) 10jenkins-bot: cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/897950 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[09:06:55] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[09:23:19] <Emperor>	 !log reboot ms-be2040 T331860
[09:23:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:25] <stashbot>	 T331860: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860
[09:24:11] <icinga-wm>	 PROBLEM - Host ms-be2040 is DOWN: PING CRITICAL - Packet loss = 100%
[09:24:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:28:12] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey)
[09:29:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] node_regex: add a fixer [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897925 (owner: 10Jbond)
[09:29:20] <wikibugs>	 (03PS1) 10Cathal Mooney: Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919)
[09:29:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:30:26] <wikibugs>	 (03PS1) 10Jbond: 1.1.2: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/898685
[09:31:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] 1.1.2: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/898685 (owner: 10Jbond)
[09:31:55] <wikibugs>	 (03PS2) 10Cathal Mooney: Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919)
[09:32:08] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:32:11] <wikibugs>	 (03CR) 10Cathal Mooney: Modify policy to use in aggregate for 185.15.57.0/24 in codfw (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[09:33:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T329260)', diff saved to https://phabricator.wikimedia.org/P45848 and previous config saved to /var/cache/conftool/dbconfig/20230314-093321-marostegui.json
[09:33:27] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[09:34:14] <claime>	 Reminder we'll be repooling eqiad RO (so all active/active services, including mw ones) at 11:30UTC
[09:35:56] <wikibugs>	 (03PS6) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858)
[09:36:12] <moritzm>	 !log installing NSS security updates
[09:36:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:17] <wikibugs>	 (03CR) 10Elukey: calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[09:37:43] <wikibugs>	 (03PS1) 10JMeybohm: cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/898686 (https://phabricator.wikimedia.org/T325292)
[09:38:26] <wikibugs>	 (03PS1) 10DCausse: wdqs: export more jmx metrics to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/898687 (https://phabricator.wikimedia.org/T331405)
[09:39:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[09:39:44] <wikibugs>	 (03PS3) 10Samtar: docroot: Update privacy policy footer link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent)
[09:39:51] <wikibugs>	 (03PS3) 10Samtar: [foundationwiki] Grant translation admin rights to 'editor' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 (https://phabricator.wikimedia.org/T297396) (owner: 10Varnent)
[09:40:18] <wikibugs>	 (03PS3) 10Samtar: Enable VE on more namespaces on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders)
[09:40:28] <wikibugs>	 (03CR) 10Elukey: calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[09:40:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[09:40:36] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, wiki has clear consensus and diffConfig looks correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe)
[09:40:46] <TheresNoTime>	 jouncebot: nowandnext
[09:40:47] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 19 minute(s)
[09:40:47] <jouncebot>	 In 0 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1000)
[09:42:44] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:43:19] <claime>	 TheresNoTime: Just in case you didn't see it, I'll lock scap deployments at 10:00UTC in anticipation of eqiad RO repool at 10:30UTC
[09:43:29] <claime>	 Corrected reminder we'll be repooling eqiad RO (so all active/active services, including mw ones) at 10:30UTC
[09:43:39] <TheresNoTime>	 ah I saw 11:30 :D
[09:43:52] <claime>	 Yeah I messed up my timezones
[09:44:00] <claime>	 ><
[09:44:02] * TheresNoTime will not deploy
[09:44:22] <claime>	 It should be quick-ish, so you can probably deploy right after
[09:44:38] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/898686 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[09:48:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45849 and previous config saved to /var/cache/conftool/dbconfig/20230314-094828-marostegui.json
[09:49:53] <wikibugs>	 (03Merged) 10jenkins-bot: cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/898686 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[09:51:02] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:51:17] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10cmooney) >>! In T331886#8688615, @ayounsi wrote: >> One of my concerns is our other caching sites use matched routers for redundancy and we coul...
[09:51:34] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:53:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Ottomata) Yes, and that's okay.    The group Hal should be in then is [[ https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L1009 | analytics-platform-eng-admins ]]...
[09:53:43] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:54:40] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:56:33] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[09:56:42] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) 05Open→03In progress
[09:56:43] <jayme>	 !log disabling puppet on P:calico::kubernetes for T325268
[09:56:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:48] <stashbot>	 T325268: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268
[09:57:03] <icinga-wm>	 RECOVERY - Host ms-be2040 is UP: PING OK - Packet loss = 0%, RTA = 33.00 ms
[09:58:01] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[09:58:15] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1000)
[10:00:04] <jouncebot>	 claime: A patch you scheduled for MediaWiki infrastucture (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[10:00:15] <claime>	 !log Locking scap deployment for service switchover - T330651
[10:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:20] <stashbot>	 T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651
[10:00:43] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10elukey)
[10:00:54] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:02:35] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:02:37] <claime>	 !log Locking scap deployment for service switchover - T331541
[10:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:42] <stashbot>	 T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541
[10:03:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45850 and previous config saved to /var/cache/conftool/dbconfig/20230314-100334-marostegui.json
[10:04:59] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10MatthewVernon) @Papaul that's only 13 disks, not 14? The recent activity panel in the iDRAC shows: ` 2023-03-12T15:08:42-0500 Virtual Disk 8 on Integrated RAID Controller...
[10:14:18] <wikibugs>	 (03PS1) 10Jbond: pki: move services to pki2002 [dns] - 10https://gerrit.wikimedia.org/r/898693
[10:15:45] <jayme>	 !log enabling puppet on P:calico::kubernetes for T325268
[10:15:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:51] <stashbot>	 T325268: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268
[10:17:39] <wikibugs>	 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10akosiaris)
[10:17:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[10:17:47] <wikibugs>	 10SRE: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10akosiaris)
[10:17:51] <wikibugs>	 10SRE: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10akosiaris)
[10:18:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T329260)', diff saved to https://phabricator.wikimedia.org/P45851 and previous config saved to /var/cache/conftool/dbconfig/20230314-101840-marostegui.json
[10:18:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[10:18:48] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[10:19:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[10:19:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[10:19:12] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[10:19:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45852 and previous config saved to /var/cache/conftool/dbconfig/20230314-101918-marostegui.json
[10:19:54] <wikibugs>	 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris)
[10:20:18] <wikibugs>	 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris)
[10:20:21] <wikibugs>	 (03PS5) 10Elukey: api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547)
[10:20:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[10:20:25] <wikibugs>	 (03PS2) 10Elukey: services: allow anon traffic for liftwing's paths in API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/897844 (https://phabricator.wikimedia.org/T331547)
[10:20:27] <wikibugs>	 (03PS1) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759)
[10:20:59] <wikibugs>	 (03PS2) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759)
[10:21:35] <jbond>	 !log move pki.discovery.wmnet to pki2002 (buyllseye)
[10:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:07] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] [beta] GrowthExperiments: Short leveling up notification delay for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898246 (https://phabricator.wikimedia.org/T330358) (owner: 10Gergő Tisza)
[10:23:11] <wikibugs>	 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris) p:05Triage→03Medium
[10:23:39] <wikibugs>	 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10akosiaris) p:05Triage→03Medium
[10:25:08] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:25:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: move services to pki2002 [dns] - 10https://gerrit.wikimedia.org/r/898693 (owner: 10Jbond)
[10:28:19] <claime>	 !log Running sre.switchdc.mediawiki.00-optional-warmup-caches - T331541
[10:28:23] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches
[10:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:25] <stashbot>	 T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541
[10:28:33] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=99)
[10:28:38] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches
[10:28:56] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet on all recursors
[10:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:29:00] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet on all recursors
[10:29:15] <icinga-wm>	 PROBLEM - Host centrallog1001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:29:49] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:29:51] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:30:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10SLyngshede-WMF) Attributes we will be needing:    -     wikimediaGlobalAccountId (MediaWiki SUL account) (optional)   -     wikimediaGlobalAccountName (MediaWiki SUL account...
[10:32:27] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=0)
[10:32:51] <claime>	 !log Repooling all active/active services in eqiad - T331541
[10:32:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:01] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:33:13] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541
[10:33:19] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 started.
[10:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:37:35] <icinga-wm>	 PROBLEM - DPKG on stat1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[10:38:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45853 and previous config saved to /var/cache/conftool/dbconfig/20230314-103813-marostegui.json
[10:38:19] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[10:39:52] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey)
[10:42:52] <jbond>	 !log reimage pki-root1001
[10:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:40] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host pki-root1001.eqiad.wmnet with OS bullseye
[10:47:52] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541
[10:47:57] <stashbot>	 T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541
[10:47:58] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 comple...
[10:48:12] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool restbase-async in codfw: T331541
[10:48:12] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: T331541
[10:48:59] <icinga-wm>	 ACKNOWLEDGEMENT - dump of m2 in eqiad on backupmon1001 is CRITICAL: dump for m2 at eqiad (db1117) taken more than a week ago: Most recent backup 2023-02-28 03:17:30 Marostegui Waiting for the retry https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:49:17] <icinga-wm>	 RECOVERY - dump of m2 in eqiad on backupmon1001 is OK: Last dump for m2 at eqiad (db1117) taken on 2023-03-14 03:23:39 (550 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:49:31] <wikibugs>	 (03CR) 10Volans: [C: 03+2] docstrings: automatically document type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans)
[10:53:19] <wikibugs>	 (03Merged) 10jenkins-bot: docstrings: automatically document type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans)
[10:53:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45854 and previous config saved to /var/cache/conftool/dbconfig/20230314-105319-marostegui.json
[10:53:21] <wikibugs>	 (03PS1) 10Elukey: ml-services: tune autoscaling for ORES model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763)
[10:53:53] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm, minor query" [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey)
[10:58:48] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pki-root1001.eqiad.wmnet with reason: host reimage
[10:59:20] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "Very neat!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey)
[10:59:31] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] services: allow anon traffic for liftwing's paths in API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/897844 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey)
[10:59:33] <wikibugs>	 (03CR) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey)
[10:59:37] <TheresNoTime>	 claime: can you let me know when I'm okay to do a quick deploy?
[10:59:59] <claime>	 TheresNoTime: We're experiencing some weirdness in pooling status rn, will tell you when we're ok
[11:00:08] <TheresNoTime>	 ack, good luck! :)
[11:00:28] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "These look fine for now!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey)
[11:02:44] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki-root1001.eqiad.wmnet with reason: host reimage
[11:02:44] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache api-ro.discovery.wmnet on all recursors
[11:02:47] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) api-ro.discovery.wmnet on all recursors
[11:02:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) >>! In T331647#8690861, @Ottomata wrote: > Yes, and that's okay.   >  > The group Hal should be in then is [[ https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/...
[11:03:55] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
[11:03:58] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
[11:06:55] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: tune autoscaling for ORES model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey)
[11:07:36] <icinga-wm>	 RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[11:08:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45855 and previous config saved to /var/cache/conftool/dbconfig/20230314-110826-marostegui.json
[11:08:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[11:11:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:12:07] <wikibugs>	 (03PS1) 10Urbanecm: arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973)
[11:12:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:12:46] <urbanecm>	 TheresNoTime: once you're given the green light to deploy, would you mind taking ^^ with you?
[11:12:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:13:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) (owner: 10Urbanecm)
[11:13:05] <TheresNoTime>	 urbanecm: 898700? Sure :)
[11:13:08] <urbanecm>	 yup
[11:13:13] * urbanecm goes to fix CI in the meantime
[11:13:18] <urbanecm>	 (on that patch)
[11:13:23] <urbanecm>	 ty
[11:13:37] <claime>	 !log We are encountering unexpected DNS anycast issued following T331541, latencies are increased but no production outage.
[11:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:42] <stashbot>	 T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541
[11:13:58] <wikibugs>	 (03PS2) 10Urbanecm: arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973)
[11:16:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: add pint for thanos-rule [puppet] - 10https://gerrit.wikimedia.org/r/898701 (https://phabricator.wikimedia.org/T309182)
[11:17:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:19:23] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache api-ro.discovery.wmnet on all recursors
[11:19:27] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) api-ro.discovery.wmnet on all recursors
[11:20:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey)
[11:21:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: allow anon traffic for liftwing's paths in API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/897844 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey)
[11:21:46] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Add blackbox http check for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/897917 (https://phabricator.wikimedia.org/T327978) (owner: 10EoghanGaffney)
[11:22:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40102/console" [puppet] - 10https://gerrit.wikimedia.org/r/898701 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[11:23:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45856 and previous config saved to /var/cache/conftool/dbconfig/20230314-112333-marostegui.json
[11:23:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[11:23:39] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[11:23:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[11:23:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T329260)', diff saved to https://phabricator.wikimedia.org/P45857 and previous config saved to /var/cache/conftool/dbconfig/20230314-112354-marostegui.json
[11:27:24] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[11:27:47] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[11:29:18] <jinxer-wm>	 (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[11:34:18] <jinxer-wm>	 (MediaWikiMemcachedHighErrorRate) resolved: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[11:35:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:38:09] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[11:38:17] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[11:38:30] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
[11:39:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
[11:39:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:40:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:41:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:41:42] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[11:42:01] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert)
[11:42:02] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[11:42:19] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) p:05Triage→03High
[11:42:21] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[11:43:58] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[11:46:21] <wikibugs>	 (03PS3) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759)
[11:46:35] <wikibugs>	 (03CR) 10Elukey: services: set custom rate limits for Lift Wing on the API Gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey)
[11:49:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Ottomata) Hm, that group (as well as analytics-research-admins) gives some sudo rights to a system user (analytics-platform-eng) that does have analytics-privatedata-users access, so I think it does require...
[11:49:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T329260)', diff saved to https://phabricator.wikimedia.org/P45860 and previous config saved to /var/cache/conftool/dbconfig/20230314-114957-marostegui.json
[11:50:03] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[11:51:35] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool appservers-ro in eqiad: T331541
[11:51:40] <stashbot>	 T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541
[11:51:58] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache appservers-ro.discovery.wmnet on all recursors
[11:52:01] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) appservers-ro.discovery.wmnet on all recursors
[11:52:21] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool appservers-ro in eqiad: T331541
[11:58:48] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey)
[12:01:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: set custom rate limits for Lift Wing on the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898695 (https://phabricator.wikimedia.org/T325759) (owner: 10Elukey)
[12:03:42] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[12:03:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[12:05:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45861 and previous config saved to /var/cache/conftool/dbconfig/20230314-120503-marostegui.json
[12:05:14] <wikibugs>	 10ops-eqiad, 10cloud-services-team (Hardware): cloudcontrol1007: power supply temperature critical - https://phabricator.wikimedia.org/T331984 (10aborrero)
[12:05:41] <wikibugs>	 10ops-eqiad, 10cloud-services-team (Hardware): cloudcontrol1007: power supply temperature critical - https://phabricator.wikimedia.org/T331984 (10aborrero) p:05Triage→03Medium
[12:06:53] <claime>	 !log Unlocked scap deployments - T331541
[12:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:58] <stashbot>	 T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541
[12:07:03] <claime>	 TheresNoTime, urbanecm 
[12:07:12] <TheresNoTime>	 claime: thank you :)
[12:07:15] <urbanecm>	 ty
[12:07:17] <claime>	 Go ahead, we're still having issues, but nothing that warrants blocking scap deploymetns
[12:08:14] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route pool appservers-ro in eqiad: T331541
[12:08:15] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache appservers-ro.discovery.wmnet on all recursors
[12:08:19] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) appservers-ro.discovery.wmnet on all recursors
[12:08:32] <wikibugs>	 10SRE, 10Scap, 10serviceops-collab, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10eoghan)
[12:08:42] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[12:08:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 (https://phabricator.wikimedia.org/T297396) (owner: 10Varnent)
[12:09:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent)
[12:09:28] <wikibugs>	 (03Merged) 10jenkins-bot: docroot: Update privacy policy footer link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent)
[12:09:31] <wikibugs>	 (03Merged) 10jenkins-bot: [foundationwiki] Grant translation admin rights to 'editor' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 (https://phabricator.wikimedia.org/T297396) (owner: 10Varnent)
[12:10:03] <wikibugs>	 (03PS1) 10Jbond: recursour: only forward to the local ns server [puppet] - 10https://gerrit.wikimedia.org/r/898704
[12:11:29] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:896224|[foundationwiki] Grant translation admin rights to 'editor' group (T297396)]], [[gerrit:896216|docroot: Update privacy policy footer link (T331680)]]
[12:11:35] <stashbot>	 T297396: Expand Governance Wiki Editor user group rights to include translate admin rights - https://phabricator.wikimedia.org/T297396
[12:11:35] <stashbot>	 T331680: Update footer links - https://phabricator.wikimedia.org/T331680
[12:13:12] <logmsgbot>	 !log samtar@deploy2002 samtar and varnent: Backport for [[gerrit:896224|[foundationwiki] Grant translation admin rights to 'editor' group (T297396)]], [[gerrit:896216|docroot: Update privacy policy footer link (T331680)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[12:13:14] <TheresNoTime>	 (testing)
[12:13:18] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool appservers-ro in eqiad: T331541
[12:13:23] <stashbot>	 T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541
[12:13:53] <TheresNoTime>	 (syncing)
[12:13:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:14:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10RobH) Since this is out of warranty, the pending purchase of 5 disks was raised to 7 on T331988 to accommodate this repair.
[12:15:30] <wikibugs>	 10SRE, 10serviceops: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10LSobanski)
[12:15:42] <TheresNoTime>	 I have a failure in scap
[12:15:55] <TheresNoTime>	 `Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.` - https://phabricator.wikimedia.org/P45862
[12:16:17] <TheresNoTime>	 (scap is rolling back)
[12:16:40] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert)
[12:18:00] <wikibugs>	 (03CR) 10Kosta Harlan: "Is there anything that needs to happen to deploy this change? Does that happen automatically?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan)
[12:18:54] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host pki-root1001.eqiad.wmnet with OS bullseye
[12:18:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:19:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:20:05] <TheresNoTime>	 !log `Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.` (P45862) during scap deployment of T297396 + T331680 — scap rolled back
[12:20:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45863 and previous config saved to /var/cache/conftool/dbconfig/20230314-122009-marostegui.json
[12:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:11] <stashbot>	 T297396: Expand Governance Wiki Editor user group rights to include translate admin rights - https://phabricator.wikimedia.org/T297396
[12:20:11] <stashbot>	 T331680: Update footer links - https://phabricator.wikimedia.org/T331680
[12:20:41] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:896224|[foundationwiki] Grant translation admin rights to 'editor' group (T297396)]], [[gerrit:896216|docroot: Update privacy policy footer link (T331680)]] (duration: 09m 12s)
[12:21:23] <TheresNoTime>	 ... okay but those changes were actually sync'd.. 
[12:21:37] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Marostegui) I just wanted to mention that despite of the sudden spike on DB reads, our databases kept up just fine in general. We did have timeouts on some enwiki (s1) replicas...
[12:21:45] <wikibugs>	 (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[12:23:47] <moritzm>	 !log installing git security updates
[12:23:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:24:19] <wikibugs>	 (03PS4) 10Samtar: Enable VE on more namespaces on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders)
[12:24:22] <wikibugs>	 (03PS3) 10Samtar: arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) (owner: 10Urbanecm)
[12:27:32] <wikibugs>	 (03PS8) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048)
[12:27:55] <wikibugs>	 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Johan)
[12:28:10] <wikibugs>	 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Johan)
[12:29:15] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: bump workers, reduce cpu, increase haproxy queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/898728 (https://phabricator.wikimedia.org/T328033)
[12:31:42] <wikibugs>	 (03PS9) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048)
[12:35:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T329260)', diff saved to https://phabricator.wikimedia.org/P45864 and previous config saved to /var/cache/conftool/dbconfig/20230314-123515-marostegui.json
[12:35:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance
[12:35:21] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[12:35:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance
[12:36:47] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992)
[12:37:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "To configure BIRD we need to know which IP address we will be using." [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[12:39:39] <wikibugs>	 (03PS41) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123)
[12:39:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:41:45] <wikibugs>	 (03PS1) 10Hokwelum: Add new mirror details to list of dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/898729
[12:41:59] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudlb: give them proper role [puppet] - 10https://gerrit.wikimedia.org/r/898730 (https://phabricator.wikimedia.org/T324992)
[12:43:14] <wikibugs>	 (03PS2) 10Hokwelum: Add new mirror details to list of dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/898729
[12:43:20] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40104/console" [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[12:43:43] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye
[12:44:31] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2003-dev.codfw.wmnet with OS bullseye
[12:44:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:45:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: give them proper role [puppet] - 10https://gerrit.wikimedia.org/r/898730 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[12:45:41] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudlb: give them proper role [puppet] - 10https://gerrit.wikimedia.org/r/898730 (https://phabricator.wikimedia.org/T324992)
[12:48:38] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:49:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:53:19] <wikibugs>	 (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[12:53:43] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add new mirror details to list of dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/898729 (owner: 10Hokwelum)
[12:54:03] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+1] Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[12:54:57] <wikibugs>	 (03PS1) 10Clément Goubert: sre.discovery.service-route: Reduce TTL before changes [cookbooks] - 10https://gerrit.wikimedia.org/r/898731
[12:54:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:55:17] <wikibugs>	 (03CR) 10Ayounsi: "Some comments, then indeed next step is to define a VIP pool." [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[12:55:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede)
[12:56:08] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 04-1] Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[12:57:34] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[12:58:12] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage
[12:58:33] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage
[12:59:22] <wikibugs>	 (03PS42) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123)
[12:59:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1300).
[13:00:05] <jouncebot>	 TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1300)
[13:00:05] <jouncebot>	 xSavitar and raynor: A patch you scheduled for Mobileapps/RESTBase/Wikifeeds is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:25] * Lucas_WMDE is mostly afk
[13:00:25] <TheresNoTime>	 I can (self-)deploy!
[13:00:53] <wikibugs>	 (03CR) 10Btullis: Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[13:01:17] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:01:27] <XioNoX>	 s'up
[13:01:43] <TheresNoTime>	 (I am holding deployment per ^)
[13:02:09] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage
[13:04:00] <XioNoX>	 can someone ack the page? my splunk app is not cooperating...
[13:04:07] <sobanski>	 On it
[13:04:17] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage
[13:04:35] <sobanski>	 Acked
[13:05:04] <volans>	 XioNoX: you can use sirenbot
[13:05:06] <effie>	 sobanski: you are fast 
[13:05:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[13:05:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[13:06:17] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:07:30] <claime>	 XioNoX: parsoid is having a worker starvation issue, but it doesn't seem related to an increase in queries
[13:08:12] * urandom is slow
[13:08:22] <XioNoX>	 fyi, the link on that page returns a "Panel not found"
[13:08:47] <claime>	 XioNoX: It works for me, at least the grafana link
[13:08:48] <sobanski>	 The Grafana one? It worked for me
[13:09:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders)
[13:09:03] <XioNoX>	 uh?
[13:09:16] <XioNoX>	 which pannel does it link to?
[13:09:27] <XioNoX>	 I only get to the dashboard
[13:09:31] <sobanski>	 Ah, there is a short-lived warning at the top of the page
[13:09:34] <claime>	 It links to the dashboard
[13:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VE on more namespaces on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders)
[13:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:10:10] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:894094|Enable VE on more namespaces on foundationwiki (T331079)]]
[13:10:11] <claime>	 But it shoudl link to https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=1678788598368&orgId=1&to=1678799398368&var-cluster=parsoid&var-datasource=codfw+prometheus%2Fops&viewPanel=64
[13:10:12] <sobanski>	 panelid=54
[13:10:16] <stashbot>	 T331079: Enable VisualEditor on all main namespaces on foundation.wikimedia.org - https://phabricator.wikimedia.org/T331079
[13:10:31] <claime>	 Yeah, sobanski, panelid is off by 10
[13:11:45] <logmsgbot>	 !log samtar@deploy2002 esanders and samtar: Backport for [[gerrit:894094|Enable VE on more namespaces on foundationwiki (T331079)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[13:11:56] <TheresNoTime>	 (testing)
[13:12:18] <TheresNoTime>	 (syncing)
[13:13:03] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:26] <claime>	 I think the grafana url is borked completely
[13:14:39] <claime>	 It at least links to the right dash, but not to the fullscreen panel
[13:14:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:15:26] <wikibugs>	 (03PS1) 10BBlack: recdns: add a permissive ecs-add-for for new pdns [puppet] - 10https://gerrit.wikimedia.org/r/898736
[13:15:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10Papaul) @MatthewVernon you right i didn't read disk 6.I will see if i can find disks from old ms-be*
[13:16:47] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/898736 (owner: 10BBlack)
[13:17:13] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] recdns: add a permissive ecs-add-for for new pdns [puppet] - 10https://gerrit.wikimedia.org/r/898736 (owner: 10BBlack)
[13:17:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "could be more restrictive but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/898736 (owner: 10BBlack)
[13:18:05] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:894094|Enable VE on more namespaces on foundationwiki (T331079)]] (duration: 07m 55s)
[13:18:11] <stashbot>	 T331079: Enable VisualEditor on all main namespaces on foundation.wikimedia.org - https://phabricator.wikimedia.org/T331079
[13:18:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) (owner: 10Urbanecm)
[13:18:37] <bblack>	 !log rolling out recdns fixup for missing 10/8 ECS affecting local inter-dc discovery/geoip results
[13:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:01] <wikibugs>	 (03Merged) 10jenkins-bot: arwiki: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898700 (https://phabricator.wikimedia.org/T331973) (owner: 10Urbanecm)
[13:19:22] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:898700|arwiki: Add new throttle rule (T331973)]]
[13:19:27] <stashbot>	 T331973: Temporary lift IP cap for Wiki workshop at Birzeit University on 15-18 March 2023 - https://phabricator.wikimedia.org/T331973
[13:20:55] <logmsgbot>	 !log samtar@deploy2002 samtar and urbanecm: Backport for [[gerrit:898700|arwiki: Add new throttle rule (T331973)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:20:59] <TheresNoTime>	 (syncing)
[13:21:27] <wikibugs>	 (03PS1) 10Clément Goubert: team-sre/mediawiki: Fix php-fpm alert dashboard links [alerts] - 10https://gerrit.wikimedia.org/r/898737
[13:22:35] <claime>	 XioNoX: sobanski ^
[13:23:25] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:861:1:208:80:154:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:24:31] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.154.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:24:42] <sukhe>	 er
[13:24:43] <bblack>	 ^ that's unsettling!
[13:24:46] <sukhe>	 what's this about
[13:24:49] <sukhe>	 looking
[13:24:58] <bblack>	 but it could be a false alert, maybe the check depends on wrong ECS behvaior or whatever
[13:24:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:25:01] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:25:09] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:25:13] <bblack>	 stopped the rollout though, in case
[13:25:22] <sukhe>	 thanks looking
[13:25:41] <wikibugs>	 (03CR) 10LSobanski: [C: 03+1] team-sre/mediawiki: Fix php-fpm alert dashboard links [alerts] - 10https://gerrit.wikimedia.org/r/898737 (owner: 10Clément Goubert)
[13:25:43] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:861:4:208:80:155:108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:25:49] <sukhe>	 Mar 14 13:25:42 dns1002 pdns-recursor[3751219]: Mar 14 13:25:42 Exception: Trying to set unknown setting 'ecs-add-for: 0.0.0.0/0, ::/>
[13:26:02] <sukhe>	 pdns-rec fails, hence anycast-hc fails hence bird falis
[13:26:10] * claime curses dns
[13:26:15] <bblack>	 that's how it should work
[13:26:27] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:26:29] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] team-sre/mediawiki: Fix php-fpm alert dashboard links [alerts] - 10https://gerrit.wikimedia.org/r/898737 (owner: 10Clément Goubert)
[13:26:29] <bblack>	 it's failing over service to another DC at this point, since we hit all 3x eqiad now
[13:26:33] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns1003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:26:45] <wikibugs>	 (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) (owner: 10Anzx)
[13:26:47] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:898700|arwiki: Add new throttle rule (T331973)]] (duration: 07m 24s)
[13:26:53] <wikibugs>	 (03PS1) 10BBlack: Revert "recdns: add a permissive ecs-add-for for new pdns" [puppet] - 10https://gerrit.wikimedia.org/r/898720
[13:26:54] <stashbot>	 T331973: Temporary lift IP cap for Wiki workshop at Birzeit University on 15-18 March 2023 - https://phabricator.wikimedia.org/T331973
[13:27:06] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "recdns: add a permissive ecs-add-for for new pdns" [puppet] - 10https://gerrit.wikimedia.org/r/898720 (owner: 10BBlack)
[13:27:09] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:27:15] <TheresNoTime>	 !log close UTC afternoon backport window
[13:27:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:27] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns2001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:27:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove FlaggedRevs for ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) (owner: 10Anzx)
[13:27:35] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.154.134 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:27:35] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:27:42] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/mediawiki: Fix php-fpm alert dashboard links [alerts] - 10https://gerrit.wikimedia.org/r/898737 (owner: 10Clément Goubert)
[13:27:43] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:861:2:208:80:154:134 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:27:45] <wikibugs>	 (03PS4) 10Zoranzoki21: Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE))
[13:27:49] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:27:49] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:860:3:208:80:153:77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:27:59] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:28:03] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:28:25] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.153.77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:28:29] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:28:31] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:861:1:208:80:154:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[13:28:43] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:28:49] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:29:15] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns2001 is OK: OK: UP (pid=2866885) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:29:15] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:861:4:208:80:155:108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[13:29:21] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.154.134 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[13:29:25] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:29:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:29:48] <wikibugs>	 (03PS1) 10Ssingh: recdns: add a permissive ecs-add-for for new pdns [puppet] - 10https://gerrit.wikimedia.org/r/898738
[13:30:10] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:30:18] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:30:37] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Syntax looks correct-er! :)" [puppet] - 10https://gerrit.wikimedia.org/r/898738 (owner: 10Ssingh)
[13:30:53] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] recdns: add a permissive ecs-add-for for new pdns [puppet] - 10https://gerrit.wikimedia.org/r/898738 (owner: 10Ssingh)
[13:31:18] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 182, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:31:34] <wikibugs>	 (03PS4) 10Anzx: Remove FlaggedRevs for ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762)
[13:32:52] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:33:20] <bblack>	 !log rolling out recdns fixup for missing 10/8 ECS affecting local inter-dc discovery/geoip results (again, with sukhe's more-correct variant!)
[13:33:21] <wikibugs>	 (03CR) 10Btullis: "It looks fine in general. Two queries inline and once again I'd recommend getting Janis' opinion. The -1 is for the networkpolicy and the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[13:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:32] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10ayounsi) > Unless we want to replace both at that stage? Probably not  > Ideally, longer-term, it would be nice to have both racks fairly symmet...
[13:33:40] <wikibugs>	 (03CR) 10Btullis: [C: 04-1] spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[13:34:22] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns1002 is OK: OK: UP (pid=3758841) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:34:25] <wikibugs>	 (03Abandoned) 10Jbond: recursour: only forward to the local ns server [puppet] - 10https://gerrit.wikimedia.org/r/898704 (owner: 10Jbond)
[13:34:35] <wikibugs>	 (03CR) 10Samtar: "Did you use `composer manage-dblist` to remove `ptwikisource` from the `flaggedrevs` dblist?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) (owner: 10Anzx)
[13:34:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:34:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) 05Open→03Declined Closing this task. I'll reopen if there is anything useful that comes out of the conversation. Nothing too interesting for us in pmacct changelog n...
[13:35:34] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[13:37:26] <wikibugs>	 (03Merged) 10jenkins-bot: Modify policy to use in aggregate for 185.15.57.0/24 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/898678 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[13:38:32] <wikibugs>	 (03PS1) 10Elukey: services: add staging config for Lift Wing to the API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898741
[13:38:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:41:06] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:42:26] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:42] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns1003 is OK: OK: UP (pid=120398) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:43:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:47:07] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "One small comment and LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[13:48:04] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.153.77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[13:49:38] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.154.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[13:49:48] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:51:11] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:51:20] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:52:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] paws/NFS: move paws to a project-local NFS server [puppet] - 10https://gerrit.wikimedia.org/r/896353 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott)
[13:52:44] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[13:52:58] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:54:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:55:48] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:860:3:208:80:153:77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[13:58:16] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm
[13:58:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)   - Re...
[13:58:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[13:58:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[13:58:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: netops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898746 (https://phabricator.wikimedia.org/T309182)
[13:58:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: structured-data: address warnings [alerts] - 10https://gerrit.wikimedia.org/r/898747
[13:59:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:59:57] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:861:2:208:80:154:134 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[14:00:35] <wikibugs>	 (03PS1) 10Samtar: InitialiseSettings.php: Undeploy Phonos from afwiktionary, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898749 (https://phabricator.wikimedia.org/T332006)
[14:00:48] <jbond>	 !log reimage pki1001
[14:00:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:33] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host pki1001.eqiad.wmnet with OS bullseye
[14:01:35] <wikibugs>	 (03PS10) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)
[14:02:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: "test_alerts.py::test_lint_rule[team-structured-data/data_pipelines.yaml]" [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi)
[14:02:35] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/898750
[14:05:16] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:05:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] netops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898746 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[14:05:23] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro)
[14:07:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dcops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898753 (https://phabricator.wikimedia.org/T309182)
[14:07:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: perf: deploy to 'ext' instance [alerts] - 10https://gerrit.wikimedia.org/r/898754 (https://phabricator.wikimedia.org/T309182)
[14:08:01] <wikibugs>	 (03CR) 10Cparle: "This looks fine to me Filippo, just wondering if there's any way I can test it to make sure I get a warning when I ought to?" [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi)
[14:08:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] netops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898746 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[14:08:46] <wikibugs>	 (03CR) 10Ayounsi: Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[14:09:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Decrease db2122 weight', diff saved to https://phabricator.wikimedia.org/P45866 and previous config saved to /var/cache/conftool/dbconfig/20230314-140926-root.json
[14:09:27] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert)
[14:09:32] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert)
[14:09:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: structured-data: address warnings (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi)
[14:09:50] <wikibugs>	 (03PS1) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755
[14:12:15] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cfssl in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:12:25] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:12:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:13:05] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert)
[14:13:16] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert)
[14:13:22] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert)
[14:13:37] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert)
[14:14:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert)
[14:14:12] <wikibugs>	 (03CR) 10Cparle: [C: 03+2] structured-data: address warnings [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi)
[14:14:26] <wikibugs>	 (03PS2) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755
[14:14:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) p:05Triage→03Medium
[14:14:50] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert)
[14:14:57] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:11] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:15:37] <wikibugs>	 (03Merged) 10jenkins-bot: structured-data: address warnings [alerts] - 10https://gerrit.wikimedia.org/r/898747 (owner: 10Filippo Giunchedi)
[14:15:42] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) p:05High→03Medium
[14:15:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] dcops: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898753 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[14:16:06] <wikibugs>	 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris)
[14:16:19] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pki1001.eqiad.wmnet with reason: host reimage
[14:16:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[14:16:32] <wikibugs>	 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris)
[14:16:41] <claime>	 !log All active/active services in eqiad repooled, DNS issues resolved - T331541
[14:16:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:46] <stashbot>	 T331541: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541
[14:17:17] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[14:17:23] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.491 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:17:43] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:17:59] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:18:48] <wikibugs>	 10SRE, 10serviceops: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10akosiaris)
[14:19:03] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1001.eqiad.wmnet with reason: host reimage
[14:19:06] <wikibugs>	 10SRE, 10serviceops: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10akosiaris)
[14:19:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[14:19:48] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[14:20:33] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) 05In progress→03Resolved We ran into a powerdns configuration issue which meant that instead of traffic being spread over both datacenters, we completely switched...
[14:21:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[14:21:43] <wikibugs>	 10SRE, 10serviceops: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris)
[14:21:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[14:21:55] <wikibugs>	 (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/898750
[14:22:01] <wikibugs>	 10SRE, 10serviceops: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris)
[14:22:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[14:23:30] <wikibugs>	 (03PS1) 10JMeybohm: cert-manager: Allow cfssl-issuer access to pki2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/898759 (https://phabricator.wikimedia.org/T331696)
[14:23:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[14:24:09] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:25:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10akosiaris)
[14:27:38] <wikibugs>	 (03PS3) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/898750
[14:30:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cert-manager: Allow cfssl-issuer access to pki2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/898759 (https://phabricator.wikimedia.org/T331696) (owner: 10JMeybohm)
[14:31:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/898750 (owner: 10Muehlenhoff)
[14:31:35] <wikibugs>	 (03PS1) 10Filippo Giunchedi: search-platform: deploy alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/898765 (https://phabricator.wikimedia.org/T309182)
[14:32:15] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job cfssl in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:32:46] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/898766
[14:35:54] <wikibugs>	 (03Merged) 10jenkins-bot: cert-manager: Allow cfssl-issuer access to pki2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/898759 (https://phabricator.wikimedia.org/T331696) (owner: 10JMeybohm)
[14:37:17] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki1001.eqiad.wmnet with OS bullseye
[14:37:20] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: sre.discovery.service-route: Use DNS_TTL_SHORT [cookbooks] - 10https://gerrit.wikimedia.org/r/898767
[14:37:21] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:37:30] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:37:37] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:37:45] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[14:37:53] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:38:02] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:38:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: search-platform: deploy alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/898765 (https://phabricator.wikimedia.org/T309182)
[14:38:31] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: sre.discovery.service-route: Use DNS_TTL_SHORT [cookbooks] - 10https://gerrit.wikimedia.org/r/898767
[14:40:22] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/898766 (owner: 10Jgiannelos)
[14:41:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: structured-data: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898768 (https://phabricator.wikimedia.org/T309182)
[14:42:09] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert for pki1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin1001
[14:43:28] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for pki1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin1001
[14:43:36] <wikibugs>	 (03CR) 10Cparle: [C: 03+2] structured-data: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898768 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[14:43:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10jbond)
[14:44:51] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/898766 (owner: 10Jgiannelos)
[14:44:53] <wikibugs>	 (03PS1) 10JMeybohm: cert-manager: Run cert-manager 1.10.1 in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/898770 (https://phabricator.wikimedia.org/T325292)
[14:45:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations: pki2001: decomission server - https://phabricator.wikimedia.org/T332018 (10jbond)
[14:47:58] <wikibugs>	 (03CR) 10Volans: "duplicate of Ia3917b61798b2b4e6fb0ff3676f19658f9565c72 ?" [cookbooks] - 10https://gerrit.wikimedia.org/r/898767 (owner: 10Alexandros Kosiaris)
[14:50:07] <wikibugs>	 (03CR) 10Volans: sre.discovery.service-route: Reduce TTL before changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 (owner: 10Clément Goubert)
[14:50:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: pki2001: decommission server - https://phabricator.wikimedia.org/T332018 (10Aklapper)
[14:51:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cert-manager: Run cert-manager 1.10.1 in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/898770 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[14:51:45] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:52:21] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:52:25] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.service-route: Reduce TTL before changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 (owner: 10Clément Goubert)
[14:52:33] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:53:43] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:53:48] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:53:54] <wikibugs>	 (03PS1) 10FNegri: [tbs.harbor] Fix wrong paths for Harbor certs [puppet] - 10https://gerrit.wikimedia.org/r/898773 (https://phabricator.wikimedia.org/T316323)
[14:54:53] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:55:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [tbs.harbor] Fix wrong paths for Harbor certs [puppet] - 10https://gerrit.wikimedia.org/r/898773 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri)
[14:56:25] <wikibugs>	 (03Merged) 10jenkins-bot: cert-manager: Run cert-manager 1.10.1 in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/898770 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[14:56:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[14:58:11] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[14:58:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)   - Re...
[14:58:32] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 (owner: 10Clément Goubert)
[14:59:37] <logmsgbot>	 !log jayme@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:00:12] <logmsgbot>	 !log jayme@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:00:16] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.service-route: Reduce TTL before changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/898731 (owner: 10Clément Goubert)
[15:01:06] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: sre.discovery.service-route: Use DNS_TTL_SHORT [cookbooks] - 10https://gerrit.wikimedia.org/r/898767 (owner: 10Alexandros Kosiaris)
[15:02:43] <icinga-wm>	 PROBLEM - haproxy process on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[15:02:55] <icinga-wm>	 PROBLEM - Check systemd state on cloudlb2002-dev is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:02:59] <icinga-wm>	 PROBLEM - haproxy alive on cloudlb2002-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy
[15:04:41] <icinga-wm>	 PROBLEM - haproxy alive on cloudlb2003-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy
[15:05:03] <wikibugs>	 (03PS6) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649)
[15:05:17] <icinga-wm>	 PROBLEM - haproxy process on cloudlb2003-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[15:05:21] <icinga-wm>	 PROBLEM - Check systemd state on cloudlb2003-dev is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:06:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: netops: split routinator from ping offload [alerts] - 10https://gerrit.wikimedia.org/r/898776 (https://phabricator.wikimedia.org/T309182)
[15:07:56] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[15:08:22] <wikibugs>	 (03PS1) 10JHathaway: kernel-purge: purge all at once and check return code [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011)
[15:09:25] <wikibugs>	 (03PS7) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:09:33] <wikibugs>	 (03PS7) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151)
[15:09:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:10:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: structured-data: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898768 (https://phabricator.wikimedia.org/T309182)
[15:10:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] structured-data: deploy to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/898768 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[15:13:19] <icinga-wm>	 PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:16:00] <wikibugs>	 (03PS2) 10FNegri: [tbs.harbor] Fix wrong paths for Harbor certs [puppet] - 10https://gerrit.wikimedia.org/r/898773 (https://phabricator.wikimedia.org/T316323)
[15:16:41] <wikibugs>	 (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[15:17:58] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[15:18:02] <wikibugs>	 (03PS7) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858)
[15:19:06] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging2003.codfw.wmnet with OS bullseye
[15:19:25] <wikibugs>	 (03PS8) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:19:32] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10Papaul) @MatthewVernon i repalced 6. let me which other one is having issues  `     Physical Disk 0:1:0 Online 0 3725.50 GB Not Capable SATA HDD No  Not Applicable     Phy...
[15:19:38] <wikibugs>	 (03Abandoned) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:19:48] <wikibugs>	 (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:19:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:20:53] <wikibugs>	 (03PS9) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:21:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:21:59] <wikibugs>	 10SRE, 10serviceops: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) The backports are complete and support Unicode 13 now!   ` jmm@jmm-mw-icu67:~$ php -r "var_dump(IntlChar::getUnicodeVersion());" array(4) {   [0]=>   int(13)   [1]=>   int(0)   [2]=>   int(0)...
[15:23:48] <wikibugs>	 (03PS10) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:23:52] <wikibugs>	 (03PS2) 10JHathaway: kernel-purge: purge all at once and check return code [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011)
[15:24:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:25:20] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40109/console" [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[15:28:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10decommission-hardware: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10jbond) p:05Triage→03Medium a:03jbond
[15:30:02] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10herron)
[15:30:15] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pki2001.codfw.wmnet with reason: decommission
[15:30:41] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pki2001.codfw.wmnet with reason: decommission
[15:30:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10decommission-hardware: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4697f9e6-30a6-446d-b67d-d99317a73ab5) set by jbond@cumin1001 for 5 days, 0:00:00 on 1 host(s) and thei...
[15:31:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10decommission-hardware: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10jbond)
[15:32:56] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2003.codfw.wmnet with reason: host reimage
[15:35:37] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[15:35:39] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[15:36:15] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2003.codfw.wmnet with reason: host reimage
[15:37:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: structured-data: deploy to ops/eqiad only [alerts] - 10https://gerrit.wikimedia.org/r/898783 (https://phabricator.wikimedia.org/T309182)
[15:38:00] <wikibugs>	 (03PS1) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040)
[15:38:54] <wikibugs>	 (03PS11) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:39:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:40:41] <wikibugs>	 (03PS1) 10Jbond: pki: move to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/898785 (https://phabricator.wikimedia.org/T332018)
[15:40:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:42:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: move to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/898785 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond)
[15:42:39] <wikibugs>	 (03PS2) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040)
[15:43:25] <wikibugs>	 (03PS3) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040)
[15:46:10] <wikibugs>	 (03PS12) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:46:26] <wikibugs>	 (03PS13) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:48:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:49:12] <wikibugs>	 (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:50:41] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1 C: 03+2] kernel-purge: purge all at once and check return code [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[15:51:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) @MatthewVernon hey i am about to go pickup the disk i need for you to ping me on dc-ops channel so we can coordinate the replacing of both disks. Thanks
[15:52:40] <wikibugs>	 (03CR) 10Muehlenhoff: pki: move to spare::system (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898785 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond)
[15:53:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn)
[15:53:49] <wikibugs>	 (03PS1) 10Jbond: site.pp: move pki2001 to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/898790 (https://phabricator.wikimedia.org/T332033)
[15:53:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:54:35] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: tune autoscaling for ORES model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey)
[15:54:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: move to spare::system (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898785 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond)
[15:55:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] site.pp: move pki2001 to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/898790 (https://phabricator.wikimedia.org/T332033) (owner: 10Jbond)
[15:56:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "and probably didn't need this one either 😊" [puppet] - 10https://gerrit.wikimedia.org/r/898790 (https://phabricator.wikimedia.org/T332033) (owner: 10Jbond)
[15:58:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:59:28] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2003.codfw.wmnet with OS bullseye
[15:59:31] <icinga-wm>	 RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:00:02] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts pki2001.codfw.wmnet
[16:00:04] <jouncebot>	 jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:03:15] <wikibugs>	 (03PS1) 10Jelto: install_server: use second pair of disks for /srv/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172)
[16:03:51] <wikibugs>	 (03PS1) 10Jbond: pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018)
[16:04:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:04:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 12:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Bootstrapping ceph
[16:04:21] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 12:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Bootstrapping ceph
[16:04:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond)
[16:06:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] structured-data: deploy to ops/eqiad only [alerts] - 10https://gerrit.wikimedia.org/r/898783 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[16:06:04] <wikibugs>	 (03PS1) 10Herron: profile::kafka::broker::monitoring: remove under replicated icinga check [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010)
[16:06:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond)
[16:06:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10bd808) >>! In T320794#8690975, @SLyngshede-WMF wrote: > Attributes we will be needing: >  >   -     wikimediaGlobalAccountName (MediaWiki SUL account name) (optional)  Be aw...
[16:07:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond)
[16:08:27] <wikibugs>	 (03PS2) 10Herron: profile::kafka::broker::monitoring: remove under replicated icinga check [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010)
[16:09:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:10:15] <wikibugs>	 (03CR) 10Jelto: "Hi Moritz. Can you check the partman config? I want to move /srv/gitlab-backup to two new disks (raid 1). I removed the volume from lvm an" [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[16:10:20] <wikibugs>	 (03PS3) 10Herron: profile::kafka::broker::monitoring: remove under replicated icinga check [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010)
[16:10:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job cfssl in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:10:50] <wikibugs>	 (03Merged) 10jenkins-bot: pki2001: decomission server [deployment-charts] - 10https://gerrit.wikimedia.org/r/898792 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond)
[16:10:58] <wikibugs>	 (03PS1) 10Daniel Kinzler: Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534)
[16:11:37] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.netbox
[16:11:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler)
[16:13:24] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pki2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001"
[16:15:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF)
[16:16:17] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pki2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001"
[16:16:17] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:16:17] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pki2001.codfw.wmnet
[16:16:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10decommission-hardware, 10Patch-For-Review: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `pki2001.codfw.wmnet` - pki2001.codfw.wmnet (**WARN**...
[16:19:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF)
[16:19:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) 05Open→03Resolved Both 21 and 23 replaced  `            Physical Disk 0:2:21  Online  21  7451.5 GB SATA  HDD  No      Physical Disk 0:2:22  Online  22  7451.5 GB SATA...
[16:20:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] install_server: use second pair of disks for /srv/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[16:20:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF)
[16:20:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job cfssl in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:21:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] thanos: add pint for thanos-rule [puppet] - 10https://gerrit.wikimedia.org/r/898701 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[16:22:25] <wikibugs>	 (03PS2) 10Daniel Kinzler: Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534)
[16:23:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler)
[16:23:44] <wikibugs>	 (03PS1) 10Jbond: P:mariadb: drop pki2001 from grants [puppet] - 10https://gerrit.wikimedia.org/r/898800 (https://phabricator.wikimedia.org/T332018)
[16:23:48] <wikibugs>	 (03PS1) 10Jbond: pki2001: remove last refrences to pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/898801 (https://phabricator.wikimedia.org/T332018)
[16:24:23] <icinga-wm>	 PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdy1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops
[16:24:24] <wikibugs>	 (03PS2) 10Jbond: pki2001: remove last refrences to pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/898801 (https://phabricator.wikimedia.org/T332018)
[16:24:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki2001: remove last refrences to pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/898801 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond)
[16:26:49] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware, 10Patch-For-Review: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10jbond)
[16:27:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10NHillard-WMF) I approve this access as Kim's manager - thanks!
[16:28:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010) (owner: 10Herron)
[16:28:52] <wikibugs>	 (03PS1) 10Btullis: Fix an error with the ceph::server::firewall profile. [puppet] - 10https://gerrit.wikimedia.org/r/898803 (https://phabricator.wikimedia.org/T330149)
[16:29:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10jbond) 05Open→03Resolved a:03jbond
[16:29:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) Most of the installer logic has been adapted for Bookworm, but there's one puzzling issue impacting the retrieval of our preseeded partitioning config.  In https://gi...
[16:30:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF)
[16:30:11] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40110/console" [puppet] - 10https://gerrit.wikimedia.org/r/898803 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis)
[16:31:31] <wikibugs>	 (03PS3) 10Daniel Kinzler: Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534)
[16:34:44] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix an error with the ceph::server::firewall profile. [puppet] - 10https://gerrit.wikimedia.org/r/898803 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis)
[16:36:37] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh1002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (stale by 1726017.11s). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[16:36:39] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh6002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (stale by 1726522.19s). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[16:37:16] <sukhe>	 oh ha
[16:37:26] <sukhe>	 this check seemed like a weird idea but it's so handy
[16:38:01] <icinga-wm>	 PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100%
[16:39:07] <wikibugs>	 10SRE-OnFire, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10akosiaris) Removing #SRE as the more specific working group is tagged already.
[16:40:19] <icinga-wm>	 RECOVERY - Host ms-be2067 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[16:42:35] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:43:59] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:44:21] <wikibugs>	 (03PS5) 10JMeybohm: Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035)
[16:44:33] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[16:44:35] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[16:44:38] <wikibugs>	 (03CR) 10Effie Mouzeli: Assign mediawiki roles to mw2420-mw2451 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert)
[16:44:47] <icinga-wm>	 RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops
[16:47:06] <sukhe>	 !log rolling restart of pdns-rec to pick up config changes
[16:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:15] <sukhe>	 !log rolling restart of pdns-rec in A:wikidough to pick up config changes
[16:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet
[16:50:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[16:51:25] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh5001 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (stale by 1725395.80s). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[16:51:43] <sukhe>	 ^ should resolve shortly
[16:52:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hieradata: acme_chief: authorize new cloudlb hosts to access cert [puppet] - 10https://gerrit.wikimedia.org/r/898808 (https://phabricator.wikimedia.org/T324992)
[16:53:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: acme_chief: authorize new cloudlb hosts to access cert [puppet] - 10https://gerrit.wikimedia.org/r/898808 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[16:54:38] <wikibugs>	 (03CR) 10Jbond: "fly by comment" [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[16:55:27] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10akosiaris) Remove #wikimedia-incident-actionable since I fail to find an action item that fits this projects description `Action items that came out of t...
[16:56:27] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1 C: 03+2] kernel-purge: purge all at once and check return code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898779 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[16:57:13] <icinga-wm>	 RECOVERY - haproxy process on cloudlb2003-dev is OK: PROCS OK: 1 process with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[16:57:25] <icinga-wm>	 RECOVERY - haproxy process on cloudlb2002-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[16:57:39] <icinga-wm>	 RECOVERY - Check systemd state on cloudlb2003-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:59] <icinga-wm>	 RECOVERY - Check systemd state on cloudlb2002-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:58:19] <wikibugs>	 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10akosiaris) Removing #SRE, adding the more specific SRE subteam that can probably drive this forward.
[16:59:06] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1700)
[17:01:29] <wikibugs>	 10SRE-OnFire, 10Traffic, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:02:17] <icinga-wm>	 RECOVERY - haproxy alive on cloudlb2003-dev is OK: OK check_alive uptime 303s https://wikitech.wikimedia.org/wiki/HAProxy
[17:02:50] <wikibugs>	 10SRE, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10akosiaris) As a note, I am unsure which team to triage this to during sprint week.
[17:02:58] <wikibugs>	 (03CR) 10JMeybohm: Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[17:03:13] <icinga-wm>	 RECOVERY - haproxy alive on cloudlb2002-dev is OK: OK check_alive uptime 413s https://wikitech.wikimedia.org/wiki/HAProxy
[17:05:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) 05Resolved→03Open Hi @Papaul Sorry, but there's still a missing disk in this system - `Virtual Drive 21` is absent, which I think is `Slot Number: 19` Could you...
[17:05:43] <wikibugs>	 10SRE-OnFire, 10SRE Observability, 10Sustainability (Incident Followup): create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10akosiaris) Apparently #sre_observability team has it in their board, I am tagging with that and removing #SRE
[17:06:07] <wikibugs>	 (03PS7) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967)
[17:06:35] <wikibugs>	 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10MatthewVernon) [for reference, I was on leave on 2022-08-24]
[17:07:31] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40111/console" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[17:07:33] <wikibugs>	 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10akosiaris) @dzahn anything left to do here? Would it be a good sprint week thing?
[17:07:41] <wikibugs>	 10SRE-OnFire, 10Observability-Logging, 10Sustainability (Incident Followup): create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10colewhite)
[17:08:00] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2002-dev.codfw.wmnet with OS bullseye
[17:09:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) ...this may be a different disk that's failed, so maybe it just got unseated?
[17:10:47] <wikibugs>	 (03Abandoned) 10Anzx: Remove FlaggedRevs for ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898719 (https://phabricator.wikimedia.org/T331762) (owner: 10Anzx)
[17:11:03] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2003-dev.codfw.wmnet with OS bullseye
[17:13:07] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10akosiaris)
[17:14:29] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10akosiaris) 05Open→03Resolved @cdanis, No other actionables showed up, alerti...
[17:14:39] <wikibugs>	 (03PS1) 10Hnowlan: cassandra: add stub secret for device_analytics [labs/private] - 10https://gerrit.wikimedia.org/r/898810 (https://phabricator.wikimedia.org/T320967)
[17:15:22] <wikibugs>	 (03PS3) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755
[17:16:24] <wikibugs>	 (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[17:16:41] <wikibugs>	 (03PS1) 10Andrew Bogott: Trove: increase timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/898811
[17:16:59] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] cassandra: add stub secret for device_analytics [labs/private] - 10https://gerrit.wikimedia.org/r/898810 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[17:17:05] <wikibugs>	 10SRE, 10SRE-OnFire, 10wikitech.wikimedia.org, 10User-LSobanski: Incident response tools operational readiness review - https://phabricator.wikimedia.org/T290130 (10akosiaris) Removing #wikimedia-incident-actionable as it isn't clear by mention of task, incident doc, status doc, task or something similar h...
[17:17:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Trove: increase timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/898811 (owner: 10Andrew Bogott)
[17:18:37] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992)
[17:19:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[17:20:19] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40113/console" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[17:20:40] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10serviceops, 10Sustainability (Incident Followup): Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam (2 o...
[17:21:04] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992)
[17:21:34] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10akosiaris) Removing #SRE, has already been triaged to a more...
[17:22:23] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40114/console" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[17:23:54] <wikibugs>	 10SRE, 10PyBal, 10Traffic-Icebox, 10Sustainability (Incident Followup): Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10akosiaris) Removing #SRE, traging to #traffic-icebox since this is pybal and is pretty old.
[17:24:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: tune autoscaling for ORES model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/898697 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey)
[17:25:27] <wikibugs>	 (03PS1) 10CDanis: move jameel to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/898814
[17:26:05] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10Structured Data Engineering, and 5 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:26:25] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] move jameel to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/898814 (owner: 10CDanis)
[17:29:38] <wikibugs>	 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decomission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10andrea.denisse) a:05andrea.denisse→03None
[17:29:58] <wikibugs>	 (03PS1) 10JHathaway: netboot: fix a few syntax errors [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495)
[17:30:02] <wikibugs>	 (03PS4) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755
[17:30:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos)
[17:31:32] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40117/console" [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[17:32:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos)
[17:32:19] <wikibugs>	 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Sustainability (Incident Followup), 10User-Joe: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551 (10akosiaris) 05Open→03Resolved a:03akosiaris No u...
[17:32:25] <wikibugs>	 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537 (10akosiaris)
[17:32:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos)
[17:32:48] <wikibugs>	 (03CR) 10David Caro: Improvements to maintain-dbusers and the rest-api (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[17:34:05] <wikibugs>	 10SRE, 10observability, 10Epic, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942 (10akosiaris) 05Open→03Resolved a:03akosiaris No comments in 6 years, I am gonna tentatively r...
[17:35:31] <wikibugs>	 (03PS5) 10David Caro: wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755
[17:36:32] <wikibugs>	 (03PS2) 10JHathaway: netboot: fix a few syntax errors [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495)
[17:37:00] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40118/console" [puppet] - 10https://gerrit.wikimedia.org/r/898755 (owner: 10David Caro)
[17:37:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[17:37:16] <wikibugs>	 (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[17:37:35] <wikibugs>	 10SRE, 10Release-Engineering-Team (Seen), 10Sustainability (Incident Followup): Review new service 'pre-deployment to production' checklist - https://phabricator.wikimedia.org/T141897 (10akosiaris) 05Open→03Resolved a:03akosiaris No comments, updates or anything for that matter since the opening of thi...
[17:37:45] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] netboot: fix a few syntax errors [puppet] - 10https://gerrit.wikimedia.org/r/898815 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[17:37:49] <wikibugs>	 (03PS7) 10Hnowlan: helmfile: add device-analytics configuration, namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967)
[17:37:53] <wikibugs>	 (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[17:39:52] <wikibugs>	 10SRE-swift-storage, 10Commons: Commons files missing - https://phabricator.wikimedia.org/T332019 (10Aklapper)
[17:40:12] <wikibugs>	 10SRE-swift-storage, 10Commons: 404 error for image thumbnail file on Commons - https://phabricator.wikimedia.org/T332019 (10Aklapper)
[17:42:57] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136 (10akosiaris) Removing SRE, adding #data-persistence
[17:45:20] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): Detect high server load earlier – prometheus alert? - https://phabricator.wikimedia.org/T188317 (10akosiaris) 05Open→03Resolved a:03akosiaris High server load isn't a good metric of anything, just a symptom/indication that something is wrong. We now have quite...
[17:45:52] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] helmfile: add device-analytics configuration, namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[17:49:40] <wikibugs>	 10SRE-swift-storage, 10Commons: 404 error for image thumbnail file on Commons - https://phabricator.wikimedia.org/T332019 (10Aklapper)
[17:49:55] <wikibugs>	 10SRE-swift-storage, 10Commons: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper)
[17:52:19] <wikibugs>	 10SRE-swift-storage, 10Commons: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper)
[17:52:22] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:52:23] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:52:54] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:52:55] <wikibugs>	 10SRE, 10serviceops, 10Sustainability (Incident Followup): docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris)
[17:53:04] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Volans) As an update, with the current situation even if a ho...
[17:53:04] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:53:09] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): create swift container-to-container synchronization metrics - https://phabricator.wikimedia.org/T229117 (10akosiaris) 05Open→03Declined I am gonna close this as Declined. Many years later I don't remember any...
[17:53:18] <wikibugs>	 10SRE, 10serviceops, 10Sustainability (Incident Followup): docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris)
[17:53:32] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): create a docker_registry_codfw swift container backup - https://phabricator.wikimedia.org/T229118 (10akosiaris) 05Open→03Declined I am gonna close this as Declined. Many years later I don't remember any more...
[17:53:44] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile: add device-analytics configuration, namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[17:55:28] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:55:42] <wikibugs>	 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136 (10Kappakayala)
[17:55:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:56:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[17:57:35] <wikibugs>	 (03CR) 10David Caro: Improvements to maintain-dbusers and the rest-api (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[17:58:07] <wikibugs>	 (03PS4) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040)
[17:58:37] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[17:59:01] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[17:59:25] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[17:59:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[18:00:01] <wikibugs>	 (03PS5) 10David Caro: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[18:00:04] <jouncebot>	 brennen and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1800).
[18:01:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[18:01:58] <brennen>	 o/
[18:03:44] <brennen>	 !log 1.40.0-wmf.27 train (T330205): no current blockers, rolling to group0.
[18:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:49] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205
[18:05:05] <wikibugs>	 (03PS6) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040)
[18:05:47] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898818 (https://phabricator.wikimedia.org/T330205)
[18:05:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898818 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot)
[18:06:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply
[18:06:08] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[18:06:46] <wikibugs>	 (03PS7) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040)
[18:06:49] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898818 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot)
[18:10:57] <wikibugs>	 (03PS1) 10Btullis: Fix the srange format for the ceph servers srange [puppet] - 10https://gerrit.wikimedia.org/r/898819 (https://phabricator.wikimedia.org/T330149)
[18:12:44] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40119/console" [puppet] - 10https://gerrit.wikimedia.org/r/898803 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis)
[18:13:08] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging2002.codfw.wmnet with OS bullseye
[18:13:34] <wikibugs>	 (03PS1) 10Hnowlan: device-analytics: add missing mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/898820 (https://phabricator.wikimedia.org/T320967)
[18:13:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[18:13:48] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.27  refs T330205
[18:13:53] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205
[18:13:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:15:28] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply
[18:18:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:22:25] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided)
[18:22:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10bd808) >>! In T320794#8693751, @SLyngshede-WMF wrote: > We actually have uid, cn and sn which are all by default set to the developer account username.   Possibly pedantic,...
[18:22:55] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 30s)
[18:23:33] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn)
[18:24:24] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn)
[18:25:34] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[18:25:42] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply
[18:25:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[18:27:09] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fix the srange format for the ceph servers srange [puppet] - 10https://gerrit.wikimedia.org/r/898819 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis)
[18:27:33] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2002.codfw.wmnet with reason: host reimage
[18:27:54] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided)
[18:28:04] <wikibugs>	 (03PS1) 10Hnowlan: cassandra: fix device_analytics creation syntax [puppet] - 10https://gerrit.wikimedia.org/r/898824 (https://phabricator.wikimedia.org/T320967)
[18:28:06] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 11s)
[18:30:17] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] cassandra: fix device_analytics creation syntax [puppet] - 10https://gerrit.wikimedia.org/r/898824 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[18:32:29] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2002.codfw.wmnet with reason: host reimage
[18:42:42] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "Worth having a go" [deployment-charts] - 10https://gerrit.wikimedia.org/r/898728 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[18:43:57] <wikibugs>	 (03PS1) 10David Caro: replica_cnf: use correct paws userhomes path [puppet] - 10https://gerrit.wikimedia.org/r/898831
[18:44:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] replica_cnf: use correct paws userhomes path [puppet] - 10https://gerrit.wikimedia.org/r/898831 (owner: 10David Caro)
[18:51:39] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2002.codfw.wmnet with OS bullseye
[18:57:39] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[19:01:35] <wikibugs>	 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh)
[19:02:22] <wikibugs>	 (03PS1) 10Krinkle: rdbms: Add db_log_category=performance to TransactionProfiler [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898725
[19:02:25] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) 05Open→03Resolved a:03ssingh Closing this in favour of T321309 where it is being tracked and also given that the Ganeti reimaging cookbook exists which was the prim...
[19:02:33] <wikibugs>	 10SRE, 10Traffic: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh)
[19:12:00] <MatmaRex>	 hello. anyone around who could check on a maintenance script run for me? i hope it has finished. https://phabricator.wikimedia.org/T315510#8656935
[19:13:05] <taavi>	 I'll have a look
[19:13:26] <taavi>	 MatmaRex: says 'Processed 2315500 (updated 1154395) of 7380439 rows'
[19:14:10] <MatmaRex>	 ugh okay. thanks
[19:14:19] <MatmaRex>	 isn't that going slower than before the switchover?
[19:14:35] <taavi>	 maybe?
[19:14:41] <taavi>	 it feels very slow either way
[19:14:51] <MatmaRex>	 hm, when are we switching back? will it finish in time? :P
[19:16:07] <taavi>	 hm. given I had to restart it once, maybe the 'processed X of Y' numbers are not accurate? since X would be number of row processed on this run but Y number of total rows
[19:16:25] <MatmaRex>	 the total number is wrong
[19:16:55] <MatmaRex>	 but ladsgroup made me do it :D
[19:17:22] <MatmaRex>	 i'm not sure how many rows there are to really process. maybe 3M or so
[19:18:26] <MatmaRex>	 if we do end up restarting it, we should do it with the --start parameter as printed by the script
[19:20:09] <MatmaRex>	 oh, i guess you did it? i'm not sure what you meant in your message
[19:21:42] <wikibugs>	 10SRE-swift-storage, 10Commons: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Yann) Other people got the same error: https://commons.wikimedia.org/wiki/Commons:Village_pump#Files_are_not_appearing.
[19:21:58] <taavi>	 yes, I did it with --start. I just noted that even with --start the number of processed rows printed is processed since starting the script that time, not total, so the speed is not comparable with runs without a restart in the middle
[19:23:49] <MatmaRex>	 mhm
[19:24:06] <MatmaRex>	 thanks for doing that!
[19:25:06] <MatmaRex>	 i wonder if it would be okay to run these in parallel on several wikis
[19:30:40] <brennen>	 !log 1.40.0-wmf.27 train (T330205): uneventful at group0.  i'm afk for about an hour.
[19:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:45] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205
[19:32:51] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[19:32:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[19:40:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[19:41:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Improvements to maintain-dbusers and the rest-api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[19:44:30] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt with reason: cloudsw1-b1-codfw OS upgrade
[19:44:46] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt with reason: cloudsw1-b1-codfw OS upgrade
[19:44:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=484494a0-6cb6-44...
[19:47:17] <topranks>	 !log Reboot cloudsw1-b1-codfw to upgrade JunOS version T327919
[19:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:22] <stashbot>	 T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919
[19:51:00] <icinga-wm>	 PROBLEM - Host cloudlb2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[19:53:26] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:54:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 139, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:59:14] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898843 (https://phabricator.wikimedia.org/T331313)
[19:59:18] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898844 (https://phabricator.wikimedia.org/T329407)
[19:59:20] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313)
[19:59:22] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407)
[20:00:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable new Vector (2022) "Add topic" button at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898843 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński)
[20:00:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński)
[20:00:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable DiscussionTools usability improvements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898844 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński)
[20:01:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper)
[20:02:20] <icinga-wm>	 RECOVERY - Host cloudlb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms
[20:02:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper)
[20:02:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:02:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper) p:05Triage→03Unbreak!
[20:02:56] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:03:42] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm
[20:03:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)...
[20:04:02] <wikibugs>	 (03PS8) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040)
[20:04:14] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[20:04:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[20:05:12] <wikibugs>	 (03CR) 10Raymond Ndibe: Improvements to maintain-dbusers and the rest-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[20:06:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Improvements to maintain-dbusers and the rest-api [puppet] - 10https://gerrit.wikimedia.org/r/898784 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[20:09:39] <wikibugs>	 (03PS1) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080)
[20:10:10] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] wmcs: move maintaindbusers to cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/898755 (owner: 10David Caro)
[20:10:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[20:20:18] <wikibugs>	 (03PS2) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080)
[20:20:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[20:24:37] <wikibugs>	 (03PS2) 10Zabe: dewiki: Allow 'crats to remove sysopship and manage importers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921)
[20:24:52] <zabe>	 jouncebot: nowandnext
[20:24:52] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T2000)
[20:24:52] <jouncebot>	 In 9 hour(s) and 35 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T0600)
[20:24:59] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] dewiki: Allow 'crats to remove sysopship and manage importers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe)
[20:25:12] <wikibugs>	 (03PS1) 10David Caro: replica_cnf: skip toolforge users without a home [puppet] - 10https://gerrit.wikimedia.org/r/898852
[20:26:11] <wikibugs>	 (03Merged) 10jenkins-bot: dewiki: Allow 'crats to remove sysopship and manage importers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe)
[20:27:03] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:897997|dewiki: Allow 'crats to remove sysopship and manage importers (T331921)]]
[20:27:09] <stashbot>	 T331921: enable de-wp bureaucrats to remove adminflag and to grant importer rights - https://phabricator.wikimedia.org/T331921
[20:28:20] <wikibugs>	 10SRE, 10Traffic: Cleanup and refactor the dnsrecursor module - https://phabricator.wikimedia.org/T332083 (10ssingh)
[20:28:41] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:897997|dewiki: Allow 'crats to remove sysopship and manage importers (T331921)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:28:47] <wikibugs>	 10SRE, 10Traffic: Clean up and refactor the dnsrecursor module - https://phabricator.wikimedia.org/T332083 (10ssingh)
[20:29:13] <wikibugs>	 10SRE, 10Traffic: Clean up and refactor the dnsrecursor module - https://phabricator.wikimedia.org/T332083 (10ssingh) p:05Triage→03Medium
[20:33:43] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898843 (https://phabricator.wikimedia.org/T331313)
[20:33:45] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898844 (https://phabricator.wikimedia.org/T329407)
[20:33:47] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313)
[20:33:49] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407)
[20:35:39] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:897997|dewiki: Allow 'crats to remove sysopship and manage importers (T331921)]] (duration: 08m 36s)
[20:35:45] <stashbot>	 T331921: enable de-wp bureaucrats to remove adminflag and to grant importer rights - https://phabricator.wikimedia.org/T331921
[20:36:46] <wikibugs>	 10SRE-swift-storage, 10Commons: 404 error for image thumbnail file on Commons - https://phabricator.wikimedia.org/T332019 (10Yann) Duplicate to T331820
[20:39:29] <wikibugs>	 (03PS3) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080)
[20:40:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[20:41:32] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on cloudcontrol1006 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:41:53] <wikibugs>	 (03PS4) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080)
[20:47:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:47:41] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm
[20:47:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)...
[20:47:58] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[20:48:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[20:52:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:54:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 227.9k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[21:02:04] <wikibugs>	 (03PS2) 10Raymond Ndibe: replica_cnf: skip toolforge users without a home [puppet] - 10https://gerrit.wikimedia.org/r/898852 (owner: 10David Caro)
[21:02:48] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on cloudcontrol1007 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:09:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Don-vip) Another example: https://commons.wikimedia.org/wiki/File:The_Sky_is_Not_the_Limit;_There_Are_Footprints_on_the_Moon_(5932386).jpeg https://upload.wikimedia.org/wikipedia/commons/th...
[21:11:04] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm
[21:11:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)...
[21:11:22] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[21:11:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[21:11:34] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[21:11:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)...
[21:13:32] <wikibugs>	 (03PS1) 10Umherirrender: action: Restrict action.delete.js to action=delete pages [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898867 (https://phabricator.wikimedia.org/T330205)
[21:16:57] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[21:17:08] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[21:19:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[21:20:58] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[21:38:00] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[21:38:20] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[22:08:35] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm
[22:25:27] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[22:34:01] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm
[22:34:34] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[22:37:28] <icinga-wm>	 RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh1002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[22:37:28] <icinga-wm>	 RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh6002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[22:50:26] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[22:52:16] <icinga-wm>	 RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh5001 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[22:58:48] <brennen>	 jouncebot nowandnext
[22:58:48] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 1 minute(s)
[22:58:48] <jouncebot>	 In 7 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T0600)
[22:59:37] <brennen>	 slinging out https://gerrit.wikimedia.org/r/c/mediawiki/core/+/898867/
[23:19:22] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:898867|action: Restrict action.delete.js to action=delete pages (T330205)]]
[23:19:28] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205
[23:20:55] <logmsgbot>	 !log brennen@deploy2002 brennen and umherirrender: Backport for [[gerrit:898867|action: Restrict action.delete.js to action=delete pages (T330205)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[23:29:55] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:898867|action: Restrict action.delete.js to action=delete pages (T330205)]] (duration: 10m 32s)
[23:30:00] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205