[00:00:05] RoanKattouw and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T0000). [00:00:05] No Gerrit patches in the queue for this window AFAICS. [00:11:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install labstore100[89] - https://phabricator.wikimedia.org/T299610 (10Andrew) Rack and network looks right to me. We might be renaming these hosts but I'll get the task retitled before the servers show up. [00:27:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_network_flows_internal-sanitization_daily.service,eventlogging_to_druid_network_flows_internal_daily.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:34] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:32:44] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7288 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [00:47:20] PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service,curator_actions_apifeatureusage_eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:14] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 5.671e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:53:40] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: miscweb1002, labstore1006, labstore1007, build2001, wdqs1010 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [01:00:05] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T0100). [01:12:06] (03CR) 10Tim Starling: [C: 03+1] "Should work, approved for self-merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752212 (owner: 10Aaron Schulz) [01:30:15] (03PS3) 10Juan90264: Create Draft namespace for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755413 [01:31:26] (03PS4) 10Juan90264: Create Draft namespace for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755413 (https://phabricator.wikimedia.org/T299224) [01:36:27] (03CR) 10Cwhite: [C: 03+1] prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [01:36:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:39:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:46:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:48:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:53:46] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.067e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:03:46] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7406 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [02:27:36] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7175 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [02:29:56] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11170 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [03:00:31] (03CR) 10Krinkle: Benchmark loading DefaultSettings from YAML (031 comment) [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [03:19:22] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: wdqs1010, miscweb1002, build2001, labstore1006, labstore1007 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [03:32:38] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:00:12] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7377 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [04:07:18] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11118 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [04:15:12] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:52:30] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, labstore1006, wdqs1010, labstore1007, miscweb1002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:09:29] (03PS1) 10Marostegui: Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755418 [06:10:56] (03PS1) 10Marostegui: Revert "mariadb: Disable notifications on a few s6 hosts" [puppet] - 10https://gerrit.wikimedia.org/r/755419 [06:11:25] (03CR) 10Marostegui: [C: 03+2] Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755418 (owner: 10Marostegui) [06:11:39] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications on a few s6 hosts" [puppet] - 10https://gerrit.wikimedia.org/r/755419 (owner: 10Marostegui) [06:14:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:14:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18896 and previous config saved to /var/cache/conftool/dbconfig/20220120-061407-marostegui.json [06:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:11] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [06:15:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1180 T299479', diff saved to https://phabricator.wikimedia.org/P18897 and previous config saved to /var/cache/conftool/dbconfig/20220120-061529-marostegui.json [06:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:33] T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479 [06:15:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18898 and previous config saved to /var/cache/conftool/dbconfig/20220120-061538-marostegui.json [06:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:43] (03PS1) 10Marostegui: db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755525 (https://phabricator.wikimedia.org/T299479) [06:17:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1180.eqiad.wmnet with OS bullseye [06:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:29] (03CR) 10Marostegui: [C: 03+2] db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755525 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [06:30:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P18899 and previous config saved to /var/cache/conftool/dbconfig/20220120-063042-marostegui.json [06:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:17] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:45:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P18900 and previous config saved to /var/cache/conftool/dbconfig/20220120-064547-marostegui.json [06:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1180.eqiad.wmnet with OS bullseye [06:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:27] (03PS1) 10Marostegui: Revert "db1180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755421 [06:51:31] (03CR) 10Marostegui: [C: 03+2] Revert "db1180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755421 (owner: 10Marostegui) [06:54:48] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=get https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [06:55:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18901 and previous config saved to /var/cache/conftool/dbconfig/20220120-065551-root.json [06:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18902 and previous config saved to /var/cache/conftool/dbconfig/20220120-070052-marostegui.json [07:00:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:00:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:57] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:01:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:01:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18903 and previous config saved to /var/cache/conftool/dbconfig/20220120-070119-marostegui.json [07:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18904 and previous config saved to /var/cache/conftool/dbconfig/20220120-070231-marostegui.json [07:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:16] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:07:16] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:10:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18905 and previous config saved to /var/cache/conftool/dbconfig/20220120-071054-root.json [07:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:40] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:17:06] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:17:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18906 and previous config saved to /var/cache/conftool/dbconfig/20220120-071736-marostegui.json [07:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:04] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:22:07] (03PS1) 10Marostegui: mariadb: Move db1128 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/755526 (https://phabricator.wikimedia.org/T299344) [07:23:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1128 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/755526 (https://phabricator.wikimedia.org/T299344) (owner: 10Marostegui) [07:24:58] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:26:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18907 and previous config saved to /var/cache/conftool/dbconfig/20220120-072558-root.json [07:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:08] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:28:50] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:30:11] 10SRE, 10envoy, 10serviceops: The TLS proxy configuration in deployment-charts allows invalid listeners - https://phabricator.wikimedia.org/T291959 (10Joe) a:03Joe [07:30:18] (03PS1) 10Giuseppe Lavagetto: _tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) [07:31:54] (03CR) 10jerkins-bot: [V: 04-1] _tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) (owner: 10Giuseppe Lavagetto) [07:31:56] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:32:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1128.eqiad.wmnet with OS bullseye [07:32:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18908 and previous config saved to /var/cache/conftool/dbconfig/20220120-073241-marostegui.json [07:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:52] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:41:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18909 and previous config saved to /var/cache/conftool/dbconfig/20220120-074105-root.json [07:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18910 and previous config saved to /var/cache/conftool/dbconfig/20220120-074746-marostegui.json [07:47:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:47:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:52] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T285149)', diff saved to https://phabricator.wikimedia.org/P18911 and previous config saved to /var/cache/conftool/dbconfig/20220120-074753-marostegui.json [07:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10MMandere) 05Open→03Resolved a:03MMandere We now have varnish upgraded from `6.0.8` to `6.0.9` in all our cache instances (across all datacent... [07:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T285149)', diff saved to https://phabricator.wikimedia.org/P18912 and previous config saved to /var/cache/conftool/dbconfig/20220120-075005-marostegui.json [07:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:06] (03PS2) 10Muehlenhoff: Make ganeti1024 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755440 (https://phabricator.wikimedia.org/T283036) [07:56:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18913 and previous config saved to /var/cache/conftool/dbconfig/20220120-075609-root.json [07:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:14] !log Stop mysql on db1117 to clone db1128 T299344 [07:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:17] T299344: Upgrade m1 to Bullseye - https://phabricator.wikimedia.org/T299344 [07:57:40] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1024 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755440 (https://phabricator.wikimedia.org/T283036) (owner: 10Muehlenhoff) [07:59:29] (03PS1) 10Giuseppe Lavagetto: httpbb: remove tests that fail under k8s [puppet] - 10https://gerrit.wikimedia.org/r/755529 (https://phabricator.wikimedia.org/T285298) [07:59:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1128.eqiad.wmnet with OS bullseye [07:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:00] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:02:26] (03PS1) 10Marostegui: install_server: Do not format db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755530 (https://phabricator.wikimedia.org/T299344) [08:02:28] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:02:34] haproxy alerts are expected [08:02:46] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:02:48] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:03:24] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:03:34] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755530 (https://phabricator.wikimedia.org/T299344) (owner: 10Marostegui) [08:03:38] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:03:58] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:03:58] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:04:13] ACKNOWLEDGEMENT - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:04:13] ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:04:13] ACKNOWLEDGEMENT - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:04:13] ACKNOWLEDGEMENT - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:04:13] ACKNOWLEDGEMENT - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:04:13] ACKNOWLEDGEMENT - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:04:14] ACKNOWLEDGEMENT - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:05:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18915 and previous config saved to /var/cache/conftool/dbconfig/20220120-080510-marostegui.json [08:05:11] (03PS1) 10Majavah: Undeploy UserMerge (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755532 (https://phabricator.wikimedia.org/T216089) [08:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:13] (03PS1) 10Majavah: Undeploy UserMerge (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755533 (https://phabricator.wikimedia.org/T216089) [08:05:15] (03PS1) 10Majavah: Undeploy UserMerge (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755534 (https://phabricator.wikimedia.org/T216089) [08:09:18] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Jelto) p:05Triage→03Medium [08:10:03] (03PS1) 10Giuseppe Lavagetto: deployment-prep: install php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/755536 (https://phabricator.wikimedia.org/T295578) [08:11:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18916 and previous config saved to /var/cache/conftool/dbconfig/20220120-081112-root.json [08:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:49] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Jelto) Thanks for the access request. But there is no group named `analytics-private-data`. I assume you mean `analytics-privatedata-users`, is that correct? If y... [08:18:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 for on-site maintenance T299123', diff saved to https://phabricator.wikimedia.org/P18917 and previous config saved to /var/cache/conftool/dbconfig/20220120-081809-marostegui.json [08:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:14] T299123: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 [08:19:39] (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755630 (https://phabricator.wikimedia.org/T299123) [08:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18918 and previous config saved to /var/cache/conftool/dbconfig/20220120-082015-marostegui.json [08:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:31] (03CR) 10Marostegui: [C: 03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755630 (https://phabricator.wikimedia.org/T299123) (owner: 10Marostegui) [08:25:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [08:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18919 and previous config saved to /var/cache/conftool/dbconfig/20220120-082616-root.json [08:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:03] (03PS1) 10Elukey: knative-serving,kserve-inference: move _helpers.tpl to 0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/755638 (https://phabricator.wikimedia.org/T292390) [08:28:05] (03CR) 10Majavah: "{{ping}}" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [08:29:10] (03Abandoned) 10Elukey: WIP - kserve-inference: add support for local tls proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/741092 (owner: 10Elukey) [08:33:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [08:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T285149)', diff saved to https://phabricator.wikimedia.org/P18920 and previous config saved to /var/cache/conftool/dbconfig/20220120-083520-marostegui.json [08:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:25] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:35:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:35:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [08:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [08:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:39] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-toolforge-tests: fix NFS mount point [puppet] - 10https://gerrit.wikimedia.org/r/755639 [08:35:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:35:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18921 and previous config saved to /var/cache/conftool/dbconfig/20220120-083558-marostegui.json [08:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:20] (03CR) 10Majavah: "can't we just use /data/project/automated-toolforge-tests for both projects?" [puppet] - 10https://gerrit.wikimedia.org/r/755639 (owner: 10Arturo Borrero Gonzalez) [08:37:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18922 and previous config saved to /var/cache/conftool/dbconfig/20220120-083711-marostegui.json [08:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:16] (03PS1) 10Elukey: helmfile.d: remove secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755640 (https://phabricator.wikimedia.org/T298976) [08:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18923 and previous config saved to /var/cache/conftool/dbconfig/20220120-084120-root.json [08:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-toolforge-tests: fix NFS mount point (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755639 (owner: 10Arturo Borrero Gonzalez) [08:44:45] (03CR) 10Elukey: [C: 03+2] helmfile.d: remove secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755640 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [08:45:13] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) Using a known broken hash like MD5 seems wrong in what's supposed to be a security-sensitive application. Since we are already calculating the SH... [08:46:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [08:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:16] (03PS3) 10Legoktm: P:mw::maintenance: add centralauth group purge job [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [08:47:18] (03CR) 10JMeybohm: [C: 03+2] Update codfw kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:47:22] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: drop leftover hash mention [puppet] - 10https://gerrit.wikimedia.org/r/755642 [08:48:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [08:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [08:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:22] (03CR) 10Legoktm: [C: 03+2] P:mw::maintenance: add centralauth group purge job [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [08:49:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [08:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:43] (03PS2) 10Elukey: admin_ng: remove the secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755441 (https://phabricator.wikimedia.org/T298976) [08:51:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:03] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18924 and previous config saved to /var/cache/conftool/dbconfig/20220120-085215-marostegui.json [08:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:18] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:59] (03CR) 10Legoktm: "legoktm@mwmaint1002:~$ systemctl status mediawiki_job_purge_expired_global_rights" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [08:53:13] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:53:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:55:41] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:55:49] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet [08:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:56:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18925 and previous config saved to /var/cache/conftool/dbconfig/20220120-085623-root.json [08:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:38] (03CR) 10Elukey: [C: 03+2] admin_ng: remove the secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755441 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [08:58:01] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet [08:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:09] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:00:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:42] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:26] (KubernetesRsyslogDown) firing: rsyslog on kubestagemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [09:01:27] 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Debug / fine tune puppet failed metrics and alerts on alert* hosts - https://phabricator.wikimedia.org/T299628 (10fgiunchedi) [09:03:07] (03CR) 10Elukey: [C: 03+2] knative-serving,kserve-inference: move _helpers.tpl to 0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/755638 (https://phabricator.wikimedia.org/T292390) (owner: 10Elukey) [09:05:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:05] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:41] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:52] that's me [09:06:26] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [09:07:10] 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Debug / fine tune puppet failed metrics and alerts on alert* hosts - https://phabricator.wikimedia.org/T299628 (10Majavah) I've noticed that when puppet fails to compile catalog, it won't show as failed but will have 0 resources, which is what happened here... [09:07:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18926 and previous config saved to /var/cache/conftool/dbconfig/20220120-090720-marostegui.json [09:07:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:28] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:07:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:51] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: drop leftover hash mention [puppet] - 10https://gerrit.wikimedia.org/r/755642 (owner: 10Arturo Borrero Gonzalez) [09:11:02] 10SRE, 10serviceops, 10Patch-For-Review, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10elukey) knative-serving and kserve-inference should be done! :) [09:11:09] (03PS4) 10Arturo Borrero Gonzalez: wmcs: factorize common arguments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754473 [09:11:11] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: introduce cookbook to repool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754555 (https://phabricator.wikimedia.org/T298948) [09:11:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18927 and previous config saved to /var/cache/conftool/dbconfig/20220120-091127-root.json [09:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:19] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:15:55] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:17:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: factorize common arguments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754473 (owner: 10Arturo Borrero Gonzalez) [09:18:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: introduce cookbook to repool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754555 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [09:18:35] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:22:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18928 and previous config saved to /var/cache/conftool/dbconfig/20220120-092225-marostegui.json [09:22:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:22:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:30] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T285149)', diff saved to https://phabricator.wikimedia.org/P18929 and previous config saved to /var/cache/conftool/dbconfig/20220120-092232-marostegui.json [09:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:26] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [09:30:58] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:32:13] (03PS1) 10Muehlenhoff: Update to 6.4.5 and enable webauthn [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 [09:33:23] (03CR) 10Jbond: [C: 03+2] wmflib::deep_merge: add a deep merge that support arrays [puppet] - 10https://gerrit.wikimedia.org/r/747525 (owner: 10Jbond) [09:33:56] (03PS1) 10Filippo Giunchedi: prometheus: skip tcp module, already in http module [puppet] - 10https://gerrit.wikimedia.org/r/755645 (https://phabricator.wikimedia.org/T291946) [09:34:57] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: skip tcp module, already in http module [puppet] - 10https://gerrit.wikimedia.org/r/755645 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:36:46] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:36:56] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:37:02] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:38:25] (03PS1) 10Jbond: pcc: make positionals optional [puppet] - 10https://gerrit.wikimedia.org/r/755646 [09:38:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33353/console" [puppet] - 10https://gerrit.wikimedia.org/r/755403 (owner: 10Jbond) [09:39:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:docker::reporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/755403 (owner: 10Jbond) [09:39:56] (03CR) 10Jbond: [C: 03+2] pcc: make positionals optional [puppet] - 10https://gerrit.wikimedia.org/r/755646 (owner: 10Jbond) [09:42:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn) [09:49:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1018.eqiad.wmnet with OS buster [09:49:01] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ganeti1018.eqiad.wmnet with OS buster [09:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1018.eqiad.wmnet with OS buster [09:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Allow configuring delayed_closed_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:50:38] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS buster [09:50:54] (03CR) 10Vgutierrez: [C: 03+2] envoy: Allow configuring delayed_closed_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:53:39] (03CR) 10Vgutierrez: [C: 03+2] cache::envoy: Set the delayed_close_timeout to 20s [puppet] - 10https://gerrit.wikimedia.org/r/755340 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:54:16] (03CR) 10Jbond: [C: 04-1] "should also update the version of Gradle in /gradle/wrapper/gradle-wrapper.properties" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 (owner: 10Muehlenhoff) [09:55:27] (03PS3) 10Jbond: P:rsyslog: add squid to the list of programs sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) [09:56:12] (03PS4) 10Jbond: P:rsyslog: add squid to the list of programs sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) [09:56:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T285149)', diff saved to https://phabricator.wikimedia.org/P18930 and previous config saved to /var/cache/conftool/dbconfig/20220120-095652-marostegui.json [09:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:57] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:57:24] (03CR) 10Jbond: P:rsyslog: add squid to the list of programs sent to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [09:57:50] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:59:00] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [10:02:14] <_joe_> uhm [10:02:59] <_joe_> just a delete that took more than 100 ms [10:03:08] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:05:28] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:07:25] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [10:11:26] (KubernetesRsyslogDown) resolved: rsyslog on kubestagemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [10:11:45] (03CR) 10Btullis: [C: 03+1] Deploy the dev version of cassandra to aqs1010.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [10:11:57] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:11:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18931 and previous config saved to /var/cache/conftool/dbconfig/20220120-101157-marostegui.json [10:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:37] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:05] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: introduce cookbook to run tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948) [10:16:07] (03PS1) 10Arturo Borrero Gonzalez: wmcs: __init__: run black -l120 [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755647 [10:16:09] (03PS1) 10Arturo Borrero Gonzalez: wmcs: refactor cmd-checklist-runner operations [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755648 [10:17:35] (03PS1) 10Jbond: O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 [10:18:03] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:18:21] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:18:31] (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond) [10:19:17] (03PS1) 10Elukey: role::pki::root: add the ml_serve intermediate PKI [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) [10:19:27] ah snap bad timing :) [10:20:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: __init__: run black -l120 [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755647 (owner: 10Arturo Borrero Gonzalez) [10:21:15] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:21:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: refactor cmd-checklist-runner operations [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755648 (owner: 10Arturo Borrero Gonzalez) [10:21:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: introduce cookbook to run tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [10:22:58] (03CR) 10Btullis: [C: 03+2] Deploy the dev version of cassandra to aqs1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [10:26:09] PROBLEM - Disk space on kubestagemaster2001 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/e5cb0fdd-6df9-42f5-8a50-01bff58133e0/volumes/kubernetes.iosecret/calico-node-token-5fmsz is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubestagemaster2001&var-datasource=codfw+prometheus/ops [10:27:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18932 and previous config saved to /var/cache/conftool/dbconfig/20220120-102702-marostegui.json [10:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:47] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: tests: use sudo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755653 [10:32:21] (03PS2) 10Jbond: O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 [10:33:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33355/console" [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond) [10:33:05] (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond) [10:33:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: tests: use sudo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755653 (owner: 10Arturo Borrero Gonzalez) [10:34:11] (03PS1) 10Ayounsi: Bump Atlas exporter scrape_timeout from 10 to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755654 (https://phabricator.wikimedia.org/T251156) [10:35:53] (03CR) 10JMeybohm: [C: 03+2] Add kubestagemaster2001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [10:36:33] (03Merged) 10jenkins-bot: Add kubestagemaster2001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [10:38:26] (03PS1) 10Btullis: Stop writing parquet logs to files [puppet] - 10https://gerrit.wikimedia.org/r/755655 (https://phabricator.wikimedia.org/T297734) [10:41:31] (03PS1) 10Filippo Giunchedi: prometheus: add Host header support to probes [puppet] - 10https://gerrit.wikimedia.org/r/755656 (https://phabricator.wikimedia.org/T291946) [10:42:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T285149)', diff saved to https://phabricator.wikimedia.org/P18933 and previous config saved to /var/cache/conftool/dbconfig/20220120-104206-marostegui.json [10:42:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:42:11] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:42:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T285149)', diff saved to https://phabricator.wikimedia.org/P18934 and previous config saved to /var/cache/conftool/dbconfig/20220120-104220-marostegui.json [10:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:39] (03PS2) 10Btullis: Stop writing parquet logs to files [puppet] - 10https://gerrit.wikimedia.org/r/755655 (https://phabricator.wikimedia.org/T297734) [10:42:45] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [10:43:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33357/console" [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [10:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T285149)', diff saved to https://phabricator.wikimedia.org/P18935 and previous config saved to /var/cache/conftool/dbconfig/20220120-104332-marostegui.json [10:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:33] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [10:45:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1018.eqiad.wmnet with OS buster [10:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:28] (03CR) 10Ayounsi: [C: 03+1] Add tls port for cloud vps rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/755478 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [10:45:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS buster completed: - ganeti1018 (**PASS**)... [10:45:51] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:46:00] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33356/console" [puppet] - 10https://gerrit.wikimedia.org/r/755656 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:47:29] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: add Host header support to probes [puppet] - 10https://gerrit.wikimedia.org/r/755656 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:49:50] (03CR) 10Ayounsi: [C: 03+2] Update automatic Icinga LLDP hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/755342 (owner: 10Ayounsi) [10:50:11] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:52:20] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [10:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:28] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [10:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:57] (03CR) 10Filippo Giunchedi: [C: 03+1] Bump Atlas exporter scrape_timeout from 10 to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755654 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [10:53:27] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10TheDJ) I've removed graphoid info from https://www.mediawiki.org/wiki/Extension:Graph to avoid further confusion for read... [10:55:51] (03CR) 10Elukey: "Hi folks! Any plan for the deployment?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [10:55:59] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [10:56:39] (03CR) 10Ayounsi: [C: 03+2] Bump Atlas exporter scrape_timeout from 10 to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755654 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [10:58:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18936 and previous config saved to /var/cache/conftool/dbconfig/20220120-105837-marostegui.json [10:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:53] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) >>! In T299527#7633551, @Cmjohnson wrote: > I updated the firmware on 1018 Thanks, with the updated firmware I was able to reim... [11:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1100) [11:06:31] (03CR) 10Elukey: [C: 03+1] Stop writing parquet logs to files [puppet] - 10https://gerrit.wikimedia.org/r/755655 (https://phabricator.wikimedia.org/T297734) (owner: 10Btullis) [11:06:54] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/754957 (owner: 10PipelineBot) [11:09:44] (03PS1) 10Filippo Giunchedi: hieradata: probe with http host override [puppet] - 10https://gerrit.wikimedia.org/r/755657 (https://phabricator.wikimedia.org/T291946) [11:09:47] (03PS1) 10Filippo Giunchedi: prometheus: add probes for non-lvs services [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946) [11:10:46] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/754957 (owner: 10PipelineBot) [11:13:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18937 and previous config saved to /var/cache/conftool/dbconfig/20220120-111341-marostegui.json [11:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:54] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging [11:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:56] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production [11:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:27] (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Also check for the analytics bridge in eqiad [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 [11:14:35] (03CR) 10Muehlenhoff: sre.ganeti.addnode: Also check for the analytics bridge in eqiad (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 (owner: 10Muehlenhoff) [11:15:22] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 (owner: 10Muehlenhoff) [11:16:39] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:16:51] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: sync on staging [11:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:02] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply on production [11:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:05] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply on staging [11:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:12] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [11:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:21] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [11:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:36] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: sync on production [11:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:22] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33358/console" [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:20:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Return a set, not a list, from active_images() [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748873 (owner: 10RLazarus) [11:21:53] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [11:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:56] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 03s) [11:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:02] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [11:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:11] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [11:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:08] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply on production [11:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:10] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply on staging [11:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:18] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: sync on production [11:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:38] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/723607 (owner: 10PipelineBot) [11:25:49] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/723658 (owner: 10PipelineBot) [11:28:09] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:28:31] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T285149)', diff saved to https://phabricator.wikimedia.org/P18938 and previous config saved to /var/cache/conftool/dbconfig/20220120-112846-marostegui.json [11:28:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:28:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:50] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [11:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:52] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T285149)', diff saved to https://phabricator.wikimedia.org/P18939 and previous config saved to /var/cache/conftool/dbconfig/20220120-112854-marostegui.json [11:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:56] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:38] (03PS3) 10Jbond: O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 [11:30:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T285149)', diff saved to https://phabricator.wikimedia.org/P18940 and previous config saved to /var/cache/conftool/dbconfig/20220120-113006-marostegui.json [11:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33359/console" [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond) [11:30:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [11:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:45] (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond) [11:30:48] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [11:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:21] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:33:34] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Also check for the analytics bridge in eqiad [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 (owner: 10Muehlenhoff) [11:35:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [11:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:32] (03PS5) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 [11:38:29] (03CR) 10Btullis: [C: 03+2] Stop writing parquet logs to files [puppet] - 10https://gerrit.wikimedia.org/r/755655 (https://phabricator.wikimedia.org/T297734) (owner: 10Btullis) [11:39:16] (03PS4) 10Jbond: O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 [11:39:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33360/console" [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond) [11:41:19] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:43:19] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:45:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18941 and previous config saved to /var/cache/conftool/dbconfig/20220120-114510-marostegui.json [11:45:13] (03PS3) 10Giuseppe Lavagetto: Rename main cluster to wikikube (1/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [11:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond) [11:49:51] !log add ganeti1024 to Ganeti eqiad cluster T283036 [11:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:54] T283036: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 [11:49:55] (03PS3) 10Jbond: Do NOT MERGE "role::pki::multirootca: add expiry for k8s_mlserve" [puppet] - 10https://gerrit.wikimedia.org/r/755408 [11:51:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33361/console" [puppet] - 10https://gerrit.wikimedia.org/r/755408 (owner: 10Jbond) [11:54:12] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:55:46] I deployed an update that I think broke metrics: https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=30s&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid [11:55:56] was supposed to be backwards compatible [11:56:06] do I revert for the time being? [11:57:00] jelto: what do you think? [11:57:54] (03PS1) 10Mvolz: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755666 [12:00:04] Amir1, Lucas_WMDE, and apergos: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1200). [12:00:04] noa_wmde: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18942 and previous config saved to /var/cache/conftool/dbconfig/20220120-120015-marostegui.json [12:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:29] o/ [12:00:37] noa is having IRC troubles but will hopefully join soon [12:00:41] (and I can deploy) [12:01:03] mvolz: for now I’m not deploying yet and you’re good to go if you need to roll something back [12:01:30] Lucas_WMDE: I think I will, thanks :) [12:01:47] (03CR) 10Mvolz: [C: 03+2] Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755666 (owner: 10Mvolz) [12:04:11] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging [12:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:14] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production [12:04:15] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on staging [12:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:18] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:04:24] I *did* think it might break metrics and even checked after but it took longer than I thought to show up and then moved on. 🙄 sorry for overlapping [12:05:18] (looks like nobody signed up for training today btw) [12:05:28] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging [12:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:30] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production [12:05:31] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on staging [12:05:31] (03Merged) 10jenkins-bot: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755666 (owner: 10Mvolz) [12:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:55] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging [12:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:57] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production [12:05:58] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on staging [12:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:06] excuse me the previous meeting ran over [12:06:20] there is one patch for the window that is a config patch, and no trainees scheduled [12:06:26] the one patch looked straightforward to me [12:06:36] yup, I’ll deploy it once noa joins [12:06:40] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging [12:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:42] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production [12:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:11] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: sync on staging [12:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:16] (the previous meeting is actually still going, I am trying to partoicupate in a complicated db config discussion while being here, heh) [12:08:08] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply on production [12:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:11] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply on staging [12:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:01] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: sync on production [12:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:55] hey Noa_WMDE [12:10:00] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply on production [12:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:02] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply on staging [12:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:05] are you here for your config patch? [12:10:30] Hi apergos, yes [12:10:43] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: sync on production [12:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:23] mvolz: was that the last sync? [12:11:24] I am hopefully done now [12:11:28] ok [12:11:30] You're the only one in the window, I believe Lucas_WMDE is doing actual deploys, if you don't have the rights [12:11:59] yep, that's the plan. thanks! [12:12:30] hm, my `logspam-watch` is being slow to start it seems [12:12:36] * Lucas_WMDE tries in another SSH connection [12:13:58] ok now it loaded [12:14:13] (03PS2) 10Lucas Werkmeister (WMDE): Enable usage tracking for statements in Waray Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755322 (https://phabricator.wikimedia.org/T296383) (owner: 10Noa wmde) [12:14:14] \o/ [12:15:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T285149)', diff saved to https://phabricator.wikimedia.org/P18943 and previous config saved to /var/cache/conftool/dbconfig/20220120-121520-marostegui.json [12:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:24] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [12:15:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable usage tracking for statements in Waray Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755322 (https://phabricator.wikimedia.org/T296383) (owner: 10Noa wmde) [12:16:24] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:16:54] (03Merged) 10jenkins-bot: Enable usage tracking for statements in Waray Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755322 (https://phabricator.wikimedia.org/T296383) (owner: 10Noa wmde) [12:17:31] Noa_WMDE: alright, the change is on mwdebug1001 now [12:17:33] do you know how to test it? [12:18:12] not more than keeping an eye on the dashboard no [12:18:28] have you used the WikimediaDebug extension before? [12:18:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:51] Is that the one where you can change servers? [12:18:54] yeah [12:19:10] I think so but I need to find out if it's installed [12:19:18] and I think it should be possible to test this change by purging a warwiki page on mwdebug1001 and then looking at action=info to see which entity usage it now has [12:19:23] we just need to find a page that uses Wikidata statements [12:19:59] okay it's installed [12:20:48] can I add a page and purge it directly? [12:21:06] well, it needs to be a page that uses Wikidata [12:21:38] looks like https://war.wikipedia.org/wiki/Sangkalibutan has an “other” usage on Q1 [12:21:44] so we can try that one [12:22:05] add ?action=purge with the extension enabled and set to mwdebug1001 [12:22:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:22:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:54] I completely purge (it's on mwdebug1001) [12:24:13] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: drop k8s-specific default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755692 [12:24:13] hm, so far https://war.wikipedia.org/w/index.php?title=Sangkalibutan&action=info still looks like an “other” usage [12:24:13] okay, cache purged. [12:24:15] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: refresh header comment [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755693 [12:24:17] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: migrate create_instance_with_prefix to CommonOpts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755694 [12:24:47] let’s try https://war.wikipedia.org/wiki/Orl%C3%A9ans ? (one of few pages linking to a Module:Wd, apparently) [12:25:14] ok [12:26:24] purged [12:26:36] where in the info can you see the usage type? [12:26:45] under “Wikidata entities used in this page” [12:26:55] and so far it still looks like “other” usage :/ [12:27:58] let’s use a sandbox page so we know it uses statements https://war.wikipedia.org/wiki/Gumaramit:Lucas_Werkmeister_(WMDE)/sandbox [12:28:18] yay, there’s a statement usage in https://war.wikipedia.org/w/index.php?title=Gumaramit:Lucas_Werkmeister_(WMDE)/sandbox&action=info [12:28:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:16] I think that’s good enough to deploy [12:29:17] purged [12:29:28] yeah I saw a statement [12:30:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:30:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:05] I guess it's just a very specific case to find live examples for [12:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:18] yeah [12:30:24] syncing [12:30:53] there are three deprecations at the top of logspam-watch btw, one of them witk 36k occurrences in the past hour [12:30:58] I assume someone™ is taking care of those [12:31:14] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755322|Enable usage tracking for statements in Waray Wikipedia (T296383)]] (expecting some gradual increase of wbc_entity_usage rows on warwiki) (duration: 00m 51s) [12:31:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:17] T296383: Enable statement usage tracking on warwiki - https://phabricator.wikimedia.org/T296383 [12:31:18] and not just adding hard deprecations to prod and then leaving them there [12:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:05] purging my sandbox without mwdebug now [12:32:14] entity usage still has statements, yay [12:33:32] \o/ [12:35:48] (03PS1) 10Lucas Werkmeister (WMDE): Replace remaining usages of IDatabase::fetchObject() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755667 (https://phabricator.wikimedia.org/T299471) [12:35:58] ^ let’s just backport this now [12:36:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Replace remaining usages of IDatabase::fetchObject() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755667 (https://phabricator.wikimedia.org/T299471) (owner: 10Lucas Werkmeister (WMDE)) [12:39:14] (03PS2) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: refresh comments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755693 [12:39:16] (03PS2) 10Arturo Borrero Gonzalez: wmcs: vps: migrate create_instance_with_prefix to CommonOpts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755694 [12:39:49] (03PS1) 10Lucas Werkmeister (WMDE): Fix deprecation warning from LinksUpdate::getImages() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755668 (https://phabricator.wikimedia.org/T299472) [12:40:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix deprecation warning from LinksUpdate::getImages() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755668 (https://phabricator.wikimedia.org/T299472) (owner: 10Lucas Werkmeister (WMDE)) [12:40:35] ^and this one too [12:40:40] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: probe with http host override [puppet] - 10https://gerrit.wikimedia.org/r/755657 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:40:51] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: add probes for non-lvs services [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:41:02] (03PS2) 10Filippo Giunchedi: prometheus: add probes for non-lvs services [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946) [12:41:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: create_instance_with_prefix: drop k8s-specific default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755692 (owner: 10Arturo Borrero Gonzalez) [12:46:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: create_instance_with_prefix: refresh comments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755693 (owner: 10Arturo Borrero Gonzalez) [12:46:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: migrate create_instance_with_prefix to CommonOpts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755694 (owner: 10Arturo Borrero Gonzalez) [12:50:29] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops, 10Documentation: Documentation updates in decom workflow - https://phabricator.wikimedia.org/T287388 (10Aklapper) [12:51:39] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 9 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Aklapper) Half a year later, does someone plan to pick up https://gerrit.wikimedia.org/r/c/710138 , or what is left to do in this open high prio t... [12:57:41] (03Merged) 10jenkins-bot: Replace remaining usages of IDatabase::fetchObject() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755667 (https://phabricator.wikimedia.org/T299471) (owner: 10Lucas Werkmeister (WMDE)) [12:57:45] yay [12:59:09] (03Merged) 10jenkins-bot: Fix deprecation warning from LinksUpdate::getImages() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755668 (https://phabricator.wikimedia.org/T299472) (owner: 10Lucas Werkmeister (WMDE)) [13:00:00] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/includes/: Backport: [[gerrit:755667|Replace remaining usages of IDatabase::fetchObject() (T299471)]] (1/2) (duration: 00m 56s) [13:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] T299471: PHP Deprecated: Use of Wikimedia\Rdbms\DBConnRef::fetchObject was deprecated in MediaWiki 1.37. [Called from SpecialRandomPage::selectRandomPageFromDB] - https://phabricator.wikimedia.org/T299471 [13:01:07] I’ll slightly overrun the window to finish these backports [13:01:09] jouncebot: now [13:01:09] No deployments scheduled for the next 3 hour(s) and 58 minute(s) [13:01:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/maintenance/: Backport: [[gerrit:755667|Replace remaining usages of IDatabase::fetchObject() (T299471)]] (2/2) (duration: 00m 50s) [13:01:13] nothing else going on at least [13:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:02:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/includes/deferred/LinksUpdate/LinksUpdate.php: Backport: [[gerrit:755668|Fix deprecation warning from LinksUpdate::getImages() (T299472)]] (duration: 00m 50s) [13:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:01] T299472: PHP Deprecated: Use of MediaWiki\Deferred\LinksUpdate\LinksUpdate::$mImages was deprecated in MediaWiki 1.38. [Called from MediaWiki\Extension\GlobalUsage\Hooks::onLinksUpdateComplete] - https://phabricator.wikimedia.org/T299472 [13:03:06] !log UTC morning backport window done [13:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:22] (I’ll watch the error log for a few more minutes, the deprecation volume should go down dramatically) [13:03:41] (03PS1) 10JMeybohm: Fix nrpe_check_disk_options hiera key for kubernetes staging masters [puppet] - 10https://gerrit.wikimedia.org/r/755698 (https://phabricator.wikimedia.org/T290967) [13:03:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:21] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10Patch-For-Review: add traceroute measurements to RIPE Atlas prometheus data - https://phabricator.wikimedia.org/T251156 (10ayounsi) 05Open→03Resolved a:05CDanis→03ayounsi This is done, opened T299640 for further improvements. [13:04:25] 10SRE, 10observability: Add RIPE atlas data to Prometheus - https://phabricator.wikimedia.org/T167689 (10ayounsi) [13:04:41] (03PS1) 10Filippo Giunchedi: prometheus: match SNI with Host when overridden [puppet] - 10https://gerrit.wikimedia.org/r/755699 (https://phabricator.wikimedia.org/T291946) [13:06:15] https://i.imgur.com/b9iPByi.png [13:06:35] (03CR) 10JMeybohm: [C: 03+2] Fix nrpe_check_disk_options hiera key for kubernetes staging masters [puppet] - 10https://gerrit.wikimedia.org/r/755698 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:06:35] much better [13:07:17] (03PS1) 10Elukey: Remove duplicate hiera config for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/755702 [13:07:43] (03PS1) 10Lucas Werkmeister (WMDE): Replace remaining usages of IDatabase::fetchObject()/::numRows() [extensions/CentralNotice] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755670 (https://phabricator.wikimedia.org/T286694) [13:07:48] I’ll just cherry pick the last big one too [13:08:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Replace remaining usages of IDatabase::fetchObject()/::numRows() [extensions/CentralNotice] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755670 (https://phabricator.wikimedia.org/T286694) (owner: 10Lucas Werkmeister (WMDE)) [13:08:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:56] RECOVERY - Disk space on kubestagemaster2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubestagemaster2001&var-datasource=codfw+prometheus/ops [13:09:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:09:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:13] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33362/console" [puppet] - 10https://gerrit.wikimedia.org/r/755699 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:11:17] (03Merged) 10jenkins-bot: Replace remaining usages of IDatabase::fetchObject()/::numRows() [extensions/CentralNotice] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755670 (https://phabricator.wikimedia.org/T286694) (owner: 10Lucas Werkmeister (WMDE)) [13:11:26] (03PS2) 10Muehlenhoff: Update to 6.4.5 and enable webauthn [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 [13:11:31] (03CR) 10Muehlenhoff: Update to 6.4.5 and enable webauthn (033 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 (owner: 10Muehlenhoff) [13:13:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/CentralNotice/includes/: Backport: [[gerrit:755670|Replace remaining usages of IDatabase::fetchObject()/::numRows() (T286694)]] (duration: 00m 50s) [13:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:11] T286694: Drop legacy cruft arising from introduction of ResultWrapper - https://phabricator.wikimedia.org/T286694 [13:13:13] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: match SNI with Host when overridden [puppet] - 10https://gerrit.wikimedia.org/r/755699 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:14:11] 10SRE, 10Platform Engineering, 10Wikimedia-Mailing-lists: Close / shut down public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Aklapper) [13:15:26] 10SRE, 10Kubernetes, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Aklapper) @Joe: Do you know, by any chance? (Or have some link handy?) [13:15:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti1025.eqiad.wmnet with reason: Change KVM setting in BIOS [13:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti1025.eqiad.wmnet with reason: Change KVM setting in BIOS [13:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [13:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:08] even better now https://i.imgur.com/wwHOA7K.png [13:17:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:17:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:52] not backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Renameuser/+/755507 since that looks like it would be a very rare warning [13:18:04] and also the change doesn’t look as trivial as the others [13:18:09] so I’ll just let that roll out with the train [13:18:24] pretty sure I’m actually done now [13:18:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:53] the problem with the cn one is, that it is going to reappear next week unless it gets merged into the wmf_deploy branch [13:19:33] I don’t follow [13:19:37] what wmf_deploy branch? [13:21:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [13:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:32] it’s cherry-picked on wmf.18, and merged on master before the wmf.19 branch cut, shouldn’t that be enough? [13:21:58] o_O https://wikitech.wikimedia.org/wiki/CentralNotice#Deployment [13:22:22] CentralNotice has a 'special' pratice that they have a wmf_deploy branch. The wmf branches are cut from that branch. So everything needs to be cherry-picked from master to that branch first... [13:22:29] O_o [13:22:50] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:24:01] so, are we supposed to cherry-pick the patch to wmf_deploy? [13:24:04] or merge master into wmf_deploy? [13:24:08] or is someone else responsible for that? [13:25:37] well, I uploaded https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/755674 [13:26:05] I think thats fundraising tech area? tbh I don't know what we are supposed to do. [13:27:02] CCed the person who uploaded most of the other recent wmf_deploy changes on Gerrit 🤷 [13:27:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [13:27:30] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:30:18] (03PS1) 10DCausse: ejoseph: update ssh key [puppet] - 10https://gerrit.wikimedia.org/r/755706 [13:36:20] (03CR) 10DCausse: "@EJoseph can you confirm that this is the SSH key you'll be using for production access?" [puppet] - 10https://gerrit.wikimedia.org/r/755706 (owner: 10DCausse) [13:37:06] PROBLEM - Check whether ferm is active by checking the default input chain on db1100 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:37:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [13:37:24] PROBLEM - Check systemd state on db1100 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:25] Amir1: ^ that's the host that got rebooted yesterday? [13:40:40] yes [13:40:44] let me depool it [13:40:45] (03CR) 10EJoseph: ejoseph: update ssh key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755706 (owner: 10DCausse) [13:40:46] Amir1: so puppet is disabled there [13:41:00] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Ottomata) Approved. [13:41:02] can I enable it? [13:41:03] it' [13:41:10] let me first depool it [13:41:14] no need [13:42:00] oh okay [13:42:04] let me abort my change [13:42:23] so the script didn't enable puppet at the end? [13:42:36] the recovery should arrive soon, just ran puppet [13:43:03] (03CR) 10Gehel: [C: 03+2] "I confirmed the key with Emmanuel" [puppet] - 10https://gerrit.wikimedia.org/r/755706 (owner: 10DCausse) [13:43:04] I did that manually because it just didn't get back up [13:43:19] but I forgot to reenable puppet [13:43:28] ah cool [13:43:30] no problem [13:43:45] the cookbook doesn't have "let's pick it from here" AFAIK :( [13:45:08] RECOVERY - Check whether ferm is active by checking the default input chain on db1100 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:45:08] RECOVERY - Check systemd state on db1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:34] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: support creating more than 1 instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707 [13:50:51] (03CR) 10Nskaggs: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/755489 (https://phabricator.wikimedia.org/T297683) (owner: 10Andrew Bogott) [13:51:00] !log enabled hardware virtualisation in BIOS for ganeti1025 T293909 [13:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:05] T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 [13:52:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti1024.eqiad.wmnet with reason: Change hw virt setting in BIOS [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti1024.eqiad.wmnet with reason: Change hw virt setting in BIOS [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [13:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:26] (03CR) 10Jbond: role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [13:55:47] !log Power off es1022 for onsite maintenance T299123 [13:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:51] T299123: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 [13:56:53] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10MoritzMuehlenhoff) [13:57:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10MoritzMuehlenhoff) [13:58:27] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) @Cmjohnson is now off. You can proceed as needed. [14:00:13] (03PS4) 10Jbond: P:pki::multirootca: Only override differences [puppet] - 10https://gerrit.wikimedia.org/r/755408 [14:00:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33363/console" [puppet] - 10https://gerrit.wikimedia.org/r/755408 (owner: 10Jbond) [14:03:36] !log enabled hardware virtualisation in BIOS for ganeti1024 T283036 [14:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:41] T283036: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 [14:05:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [14:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:02] (03CR) 10Volans: [C: 03+1] "LGTM for now, let's revisit once we have the group in netbox." [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 (owner: 10Muehlenhoff) [14:06:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [14:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:20] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10Volans) Although not optimal, in the worse case scenario in which we will be unable to find/modify a tool to preserve empty lines, we could also co... [14:10:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10WDoranWMF) @Dzahn Is it possible to add @MNadrofsky to the approver lists as he is the Platform Tech Director? [14:13:05] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 (owner: 10Muehlenhoff) [14:13:09] (03CR) 10Hnowlan: admin: add Desiree Abad as approver for platform-engineering groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn) [14:16:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [14:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:09] elukey: cumin [14:17:15] (03PS1) 10Filippo Giunchedi: Add prometheus[12]00[56] to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/755708 (https://phabricator.wikimedia.org/T296199) [14:20:53] !log enabled hardware virtualisation in BIOS for ganeti1023 T283036 [14:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:57] T283036: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 [14:21:24] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7635623, @Legoktm wrote: > Using a known broken hash like MD5 seems wrong in what's supposed to be a security-sensitive application. S... [14:21:28] (03CR) 10Ottomata: [C: 03+2] analytics:refinery:job:data_purge: Add deletion for anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/753052 (https://phabricator.wikimedia.org/T298972) (owner: 10Mforns) [14:25:09] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster [14:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:48] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster [14:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:20] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster [14:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:25] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 (owner: 10Muehlenhoff) [14:35:16] marostegui: cumin cumin [14:35:45] \o/ [14:36:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/755408 (owner: 10Jbond) [14:43:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [14:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:53] (03PS1) 10Filippo Giunchedi: hieradata: add host-specific Prometheus data [puppet] - 10https://gerrit.wikimedia.org/r/755711 (https://phabricator.wikimedia.org/T296199) [14:52:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [14:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:40] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) 23:35 mutante: puppetmaster1001 - revoked puppet cert miscweb.discovery.wmnet; updated kube_services.crts.yaml to include static-bugzilla.wikimedia.org, removed miscweb.... [14:54:30] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update to 6.4.5 and enable webauthn [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 (owner: 10Muehlenhoff) [14:55:40] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [14:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:52] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 11s) [14:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) @Muehlenhoff Could you comment on that? Should the structure be that it has team EMs rather than directors? And if there are multiple approv... [14:56:34] !log enabled hardware virtualisation in BIOS for ganeti1026 T293909 [14:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:38] T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 [14:57:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [14:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:54] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster [14:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:01] (03PS1) 10Filippo Giunchedi: thanos: move to a single flag to control uploads [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) [14:58:15] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster [14:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10MoritzMuehlenhoff) >>! In T298124#7636613, @Dzahn wrote: > @Muehlenhoff Could you comment on that? Should the structure be that it has team EMs rat... [15:02:32] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33364/console" [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [15:04:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [15:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:23] !log enabled hardware virtualisation in BIOS for ganeti1027 T293909 [15:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:26] T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 [15:05:31] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster [15:05:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [15:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:40] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster [15:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:02] (03PS2) 10Filippo Giunchedi: thanos: move to a single flag to control uploads [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) [15:08:31] (03PS1) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713 [15:11:36] PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:11:43] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply on main [15:11:43] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33365/console" [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:44] !log enabled hardware virtualisation in BIOS for ganeti1028 T293909 [15:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:47] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33366/console" [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [15:12:48] T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 [15:13:57] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster [15:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:24] PROBLEM - Unmerged changes on repository puppet on puppetmaster2003 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:14:35] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: sync on main [15:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [15:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:59] (03CR) 10Filippo Giunchedi: [V: 03+1] "This change will enable uploads for the 'ext' instance, which I think is fine" [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [15:16:01] !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply on main [15:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:52] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:17:12] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:17:39] ^ would not touch that one, it looks category: risk-very-high (multi root CA change :) [15:20:13] !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: sync on main [15:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:05] (03PS2) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713 [15:22:23] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster [15:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:42] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10Joe) [15:23:08] sorry merged changes now [15:23:34] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:23:41] (03PS1) 10Muehlenhoff: Fix test for virtualisation [cookbooks] - 10https://gerrit.wikimedia.org/r/755714 [15:23:56] (03CR) 10Dzahn: "looks good, just nitpick that the link in the commit message to show where it was changed links back to itself" [puppet] - 10https://gerrit.wikimedia.org/r/755329 (owner: 10Hashar) [15:24:00] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:24:13] RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:25:18] RECOVERY - Unmerged changes on repository puppet on puppetmaster2003 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:25:19] (03CR) 10Dzahn: [C: 03+2] "confirmed in upstream docs default is false" [puppet] - 10https://gerrit.wikimedia.org/r/755328 (owner: 10Hashar) [15:27:27] jbond: no problem at all, it seemed obvious that type of change might need some extra care at merge [15:28:56] (03Abandoned) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [15:31:28] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster [15:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:37] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster [15:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:14] (03PS1) 10Elukey: Set inference.discovery.wmnet to production stage [puppet] - 10https://gerrit.wikimedia.org/r/755715 (https://phabricator.wikimedia.org/T289835) [15:37:53] (03CR) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [15:42:21] 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [15:42:45] (03CR) 10Elukey: api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [15:43:30] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster [15:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:26] (03CR) 10Filippo Giunchedi: [C: 03+1] remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [15:45:48] (03CR) 10Elukey: [V: 03+1] role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [15:46:02] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:46:51] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:00] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [15:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:02] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Jelto) [15:57:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2018.codfw.wmnet with OS buster [15:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:37] (03PS2) 10Elukey: Set inference.discovery.wmnet to production stage [puppet] - 10https://gerrit.wikimedia.org/r/755715 (https://phabricator.wikimedia.org/T289835) [16:00:51] (03CR) 10Elukey: [C: 03+2] Set inference.discovery.wmnet to production stage [puppet] - 10https://gerrit.wikimedia.org/r/755715 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [16:01:05] "(Cannot access the database: Cannot access the database: Unknown database 'metawiki' (db1169) (db1169)" [16:01:08] ??? [16:01:36] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [16:03:44] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [16:08:49] (03PS1) 10Muehlenhoff: Make ganeti1025 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755723 [16:11:22] (03CR) 10Jbond: role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [16:11:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10WDoranWMF) On that basis it makes most sense to add me and Atieno(Atieno is a new EM on Platform she is setting up her phab/gerrit at the moment).... [16:12:50] PROBLEM - Host ganeti1018 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:53] Bsadowski1: not sure if you have already been told, but someone is already looking at that, seems like a recent issue [16:15:17] (03CR) 10Elukey: [V: 03+1] role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [16:16:00] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) yes, mgmt works via ssh but the new version doesn't allow me to access the web interface. I use that interface to do most firmware update... [16:16:10] (03CR) 10Jbond: role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [16:19:59] RECOVERY - Host ganeti1018 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:20:23] (03PS2) 10Elukey: Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/730541 (https://phabricator.wikimedia.org/T289835) [16:22:41] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Cmjohnson) @Marostegui BIOS and network Firmware updated, this should fix your issue. I will leave task open until you confirm all is well. [16:23:09] (03PS3) 10Elukey: Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/730541 (https://phabricator.wikimedia.org/T289835) [16:23:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) @Muehlenhoff Thank you! makes sense @WDoranWMF Ok, and yes, actually that would be ideal if you make a new request, thank you. Since it's n... [16:25:04] Lucas_WMDE: hi! I just replied on the Gerrit change... thanks for working on that.. is deployment urgent at all, or can it wait until next week's train? [16:25:16] (^ wrt CentralNotice deploy stuff) [16:26:07] AndyRussG: I just replied :) [16:26:10] nothing urgent I think [16:26:11] (03PS1) 10Ladsgroup: DatabaseBlock: Pass database name to getConnectionRef [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755676 [16:26:21] (03CR) 10Ladsgroup: [C: 03+2] DatabaseBlock: Pass database name to getConnectionRef [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755676 (owner: 10Ladsgroup) [16:26:27] PROBLEM - Host ganeti1018 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:01] jouncebot: nowandnext [16:27:01] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [16:27:01] In 0 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1700) [16:27:58] Lucas_WMDE: ah fantastic thanks so much!! :) [16:28:05] feel free to abandon the patch if it’s not needed :) [16:28:41] Lucas_WMDE: ok thanks.. yeah I'll do a general merge of master to wmf_deploy before the next branch cut, then, apologies again for the twisted process ;p [16:28:46] alright :) [16:28:53] PROBLEM - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname, address or socket: inference.discovery.wmnet https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:30:01] (03PS2) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: support creating more than 1 instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707 [16:30:06] (03CR) 10Volans: wmcs: move grid-dedicated code to its own package (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez) [16:31:19] (03PS3) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713 [16:31:29] PROBLEM - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname, address or socket: inference.discovery.wmnet https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:35:11] this is me --^ [16:35:18] ACK [16:35:32] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2018.codfw.wmnet with OS buster [16:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet [16:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:19] RECOVERY - Host ganeti1018 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:37:48] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/755714 (owner: 10Muehlenhoff) [16:38:05] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Fix test for virtualisation [cookbooks] - 10https://gerrit.wikimedia.org/r/755714 (owner: 10Muehlenhoff) [16:38:20] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [16:38:28] jouncebot: nowandnext [16:38:28] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [16:38:28] In 0 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1700) [16:39:09] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff The idrac is giving me a hard time, it's not worth slowing this process down. The idrac has no bearing on your issue.... [16:39:11] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:08] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2019.codfw.wmnet with OS buster [16:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:11] ACKNOWLEDGEMENT - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname, address or socket: inference.discovery.wmnet daniel_zahn known, will be fixed soon https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:41:11] ACKNOWLEDGEMENT - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname, address or socket: inference.discovery.wmnet daniel_zahn known, will be fixed soon https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:43:27] I'll deploy a little mw-config change if nobody minds [16:43:33] (03CR) 10Ppchelko: [C: 03+2] Add temporary entrypoint for settings benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 (owner: 10Ppchelko) [16:43:58] (03CR) 10jerkins-bot: [V: 04-1] DatabaseBlock: Pass database name to getConnectionRef [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755676 (owner: 10Ladsgroup) [16:44:23] (03Merged) 10jenkins-bot: Add temporary entrypoint for settings benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 (owner: 10Ppchelko) [16:45:31] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:49] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [16:46:59] (03PS1) 10Ladsgroup: Revert "Make Block objects aware of which wiki they belong to" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755678 [16:47:27] (03Abandoned) 10Ladsgroup: DatabaseBlock: Pass database name to getConnectionRef [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755676 (owner: 10Ladsgroup) [16:47:35] (03CR) 10Ladsgroup: [C: 03+2] Revert "Make Block objects aware of which wiki they belong to" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755678 (owner: 10Ladsgroup) [16:47:45] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:47:59] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:48:01] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2019.codfw.wmnet with OS buster [16:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2019.codfw.wmnet with OS buster [16:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) Also, let's incorporate @hnowlan's comments on https://gerrit.wikimedia.org/r/c/operations/puppet/+/755500/1/modules/admin/data/data.yaml in... [16:50:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:46] !log ppchelko@deploy1002 Synchronized w/tmp_settings_bench.php: Config: gerrit 755399 add temporary entrypoint for settings benchmark (duration: 00m 50s) [16:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:51:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:53] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [16:52:01] (03CR) 10Dzahn: [C: 03+1] admin: add Desiree Abad as approver for platform-engineering groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn) [16:52:39] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Cmjohnson) a:05Cmjohnson→03Marostegui [16:52:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:41] (03Abandoned) 10Dzahn: admin: add Desiree Abad as approver for platform-engineering groups [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn) [16:55:13] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2019.codfw.wmnet with OS buster [16:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:50] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2020.codfw.wmnet with OS buster [16:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) 05Open→03Resolved per above (we can link the new ticket here once it's created) [17:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:30] ✅ [17:00:34] (03CR) 10Elukey: [C: 03+2] Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/730541 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [17:01:19] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install labstore100[89] - https://phabricator.wikimedia.org/T299610 (10Andrew) clouddumps100x or clouddatasets100x or just datasets100x [17:01:55] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [17:01:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Prod-Kubernetes, and 3 others: decommission kubestage100[12]-eqiad - https://phabricator.wikimedia.org/T299142 (10Cmjohnson) [17:03:09] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2020.codfw.wmnet with OS buster [17:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:40] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=inference [17:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Prod-Kubernetes, and 3 others: decommission kubestage100[12]-eqiad - https://phabricator.wikimedia.org/T299142 (10Cmjohnson) 05Open→03Resolved [17:05:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:51] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2021.codfw.wmnet with OS buster [17:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:24] (03PS1) 10Dzahn: Revert "Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/755681 [17:08:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS buster [17:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster [17:09:03] (03Merged) 10jenkins-bot: Revert "Make Block objects aware of which wiki they belong to" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755678 (owner: 10Ladsgroup) [17:13:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:14:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:03] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1008.eqiad.wmnet with OS buster [17:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster executed with errors: -... [17:15:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS buster [17:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster [17:15:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:04] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.18/includes/: Backport: [[gerrit:755678|Revert "Make Block objects aware of which wiki they belong to"]] (duration: 00m 55s) [17:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:30] we might a flood of errors now [17:18:36] but it should recover [17:21:49] (03CR) 10JMeybohm: [C: 04-1] "The diff this creates does not look right, but I've no idea why. Have not looked in detail, though" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [17:24:34] (03PS16) 10Brennen Bearnes: gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [17:27:31] (03CR) 10Andrew Bogott: [C: 03+2] toolforge grid engine: install fdm [puppet] - 10https://gerrit.wikimedia.org/r/755489 (https://phabricator.wikimedia.org/T297683) (owner: 10Andrew Bogott) [17:27:57] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [17:28:05] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1008.eqiad.wmnet with OS buster [17:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster executed with errors: -... [17:28:22] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [17:34:09] RECOVERY - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is OK: TCP OK - 0.007 second response time on inference.discovery.wmnet port 30443 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:34:41] \o/ [17:34:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:35:27] elukey: uuuh, new cert? :) [17:35:56] jayme: I added the discovery endpoint (finally :) [17:36:17] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) So we think this may work, and we've ordered 2 hosts via T297151 for use and testing. [17:36:25] elukey: ah, I though you had that for quite some time already [17:39:30] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) Thanks Chris - I will try a reimage on Monday to see if it PXE boots fine. I have started mysql now so it can start catching up [17:39:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:43:03] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2021.codfw.wmnet with OS buster [17:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2021.codfw.wmnet [17:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:10] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2022.codfw.wmnet with OS buster [17:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:22] (03PS1) 10Ppchelko: Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741 [17:45:51] RECOVERY - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is OK: TCP OK - 0.009 second response time on inference.discovery.wmnet port 30443 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:47:19] (03PS2) 10Ppchelko: Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741 [17:47:43] (03PS4) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713 [17:48:43] (03PS4) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) [17:49:06] I am rebalancing partitions on the CI agent https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Dpuppet%2Ddocker%2D1002/ [17:49:16] patches to operations/puppet will be a bit delayed [17:54:30] (03CR) 10Herron: [C: 03+1] "LGTM overall, please see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/754520 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [17:55:32] (03CR) 10Herron: [C: 03+1] P:rsyslog: add squid to the list of programs sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [17:55:35] CI agent is back online [17:57:30] (03PS2) 10Dzahn: Revert "Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/755681 (https://phabricator.wikimedia.org/T281538) [17:57:55] (03PS5) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713 (https://phabricator.wikimedia.org/T292729) [17:59:30] (03CR) 10Herron: [C: 03+1] Add prometheus[12]00[56] to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/755708 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [17:59:32] (03CR) 10Hashar: [C: 03+1] "Attached to T292729 which is the real reason for this puppet change: raise /srv disk space from 18G to now 37G." [puppet] - 10https://gerrit.wikimedia.org/r/755713 (https://phabricator.wikimedia.org/T292729) (owner: 10Hashar) [18:00:00] (03CR) 10Herron: [C: 03+1] hieradata: add host-specific Prometheus data [puppet] - 10https://gerrit.wikimedia.org/r/755711 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [18:00:05] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1800). [18:04:18] (03CR) 10Herron: [C: 03+1] thanos: move to a single flag to control uploads [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [18:04:45] (03CR) 10Herron: [C: 03+1] prepare for logstash 7.16.3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755041 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [18:05:04] (03CR) 10Herron: [C: 03+1] bump patch version to update plugins [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755033 (owner: 10Cwhite) [18:08:15] (03PS1) 10Majavah: Do not try to make watchlist collapsible on wikis where watchlist is disabled [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755682 (https://phabricator.wikimedia.org/T299671) [18:08:41] jouncebot: nowandnext [18:08:41] For the next 0 hour(s) and 51 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1800) [18:08:41] In 0 hour(s) and 51 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1900) [18:08:59] I'm boldly going to deploy that Vector backport [18:09:16] (03CR) 10Majavah: [C: 03+2] Do not try to make watchlist collapsible on wikis where watchlist is disabled [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755682 (https://phabricator.wikimedia.org/T299671) (owner: 10Majavah) [18:10:55] (03CR) 10Cicalese: [C: 03+1] Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741 (owner: 10Ppchelko) [18:13:38] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/755681 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:17:11] !log running puppet on cp403* [18:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:20] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2022.codfw.wmnet with OS buster [18:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:56] (03PS1) 10Clare Ming: Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) [18:23:05] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2022.codfw.wmnet [18:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:35] (03CR) 10Ppchelko: [C: 03+2] Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741 (owner: 10Ppchelko) [18:24:16] (03Merged) 10jenkins-bot: Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741 (owner: 10Ppchelko) [18:25:23] (03CR) 10jerkins-bot: [V: 04-1] Do not try to make watchlist collapsible on wikis where watchlist is disabled [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755682 (https://phabricator.wikimedia.org/T299671) (owner: 10Majavah) [18:25:35] :// [18:26:24] (03Merged) 10jenkins-bot: Do not try to make watchlist collapsible on wikis where watchlist is disabled [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755682 (https://phabricator.wikimedia.org/T299671) (owner: 10Majavah) [18:26:51] Pchelolo: I am already deploying, can you wait a bit with your config patch? [18:27:01] taavi: [18:27:06] oh damn, sorry [18:27:34] !log ppchelko@deploy1002 Synchronized w/tmp_settings_bench.php: Config: gerrit 755741 enhancements for the settings benchmark entrypoint (duration: 00m 51s) [18:27:35] it already finished [18:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:52] ah, continuing with my backport then [18:27:59] not touching anything else anymore. [18:28:13] thanks [18:29:54] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.18/skins/Vector/includes/Hooks.php: Backport: [[gerrit:755682|Do not try to make watchlist collapsible on wikis where watchlist is disabled (T299671)]] (duration: 00m 50s) [18:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:59] T299671: Loginwiki fatals (TypeError: Argument 1 passed to Vector\Hooks::makeMenuItemCollapsible() must be of the type array, null given, called in /srv/mediawiki/php-1.38.0-wmf.18/skins/Vector/includes/Hooks.php on line 226) - https://phabricator.wikimedia.org/T299671 [18:30:10] * taavi done [18:31:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:30] (03PS2) 10Clare Ming: Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) [18:32:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:32:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:40] (03CR) 10Andrew Bogott: [C: 03+2] Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [18:32:48] (03PS9) 10Andrew Bogott: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [18:33:03] (03PS1) 10Dzahn: add a foot note to the index.html that this is now a Kubernetes service [container/miscweb] - 10https://gerrit.wikimedia.org/r/755748 (https://phabricator.wikimedia.org/T281538) [18:33:32] (03CR) 10Dzahn: [C: 03+2] add a foot note to the index.html that this is now a Kubernetes service [container/miscweb] - 10https://gerrit.wikimedia.org/r/755748 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:33:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:35] (03Merged) 10jenkins-bot: add a foot note to the index.html that this is now a Kubernetes service [container/miscweb] - 10https://gerrit.wikimedia.org/r/755748 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:38:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:40:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:00] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:48:25] (03PS1) 10EJoseph: Upgrade to elasticsearh 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) [18:50:10] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:53] (03CR) 10Jdlrobson: [C: 03+1] Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) (owner: 10Clare Ming) [18:51:40] (03PS1) 10Dzahn: miscweb: bump version to 2022-01-20-183807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/755751 (https://phabricator.wikimedia.org/T281538) [18:51:52] (03CR) 10Nray: [C: 03+1] Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) (owner: 10Clare Ming) [18:52:27] (03CR) 10Dzahn: [C: 03+2] miscweb: bump version to 2022-01-20-183807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/755751 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:52:58] dear train conductor (jeena?): i think https://phabricator.wikimedia.org/T299583 doesn't block the train, unless the log spam is too much [18:53:06] i commented there [18:54:07] Thanks MatmaRex ! [18:55:59] 10SRE, 10Foundational Technology Requests, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10RBrounley_WMF) 05In progress→03Resolved [18:56:01] (03Merged) 10jenkins-bot: miscweb: bump version to 2022-01-20-183807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/755751 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [19:00:04] RoanKattouw and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1900). [19:00:04] Juan_90264 and cjming: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:17] o/ [19:01:18] hello! [19:01:25] cjming: hi, want to deploy today? [19:01:58] sure [19:02:16] go ahead then :) [19:02:29] urbanecm: would you do the 1st one if no one else has approved? [19:02:40] meaning is it ok to go for it? [19:02:49] cjming: Juan's not around, so it should be skipped [19:03:05] cool - onward then [19:03:13] (03CR) 10Clare Ming: [C: 03+2] Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) (owner: 10Clare Ming) [19:04:04] (03Merged) 10jenkins-bot: Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) (owner: 10Clare Ming) [19:06:01] lgtm - syncing [19:06:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:18] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755745|Disable language alert for pilot wikis except thwiki, viwiki. (T295555)]] (duration: 00m 51s) [19:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:21] T295555: Language switching: put an alert in the sidebar about where the language links are - https://phabricator.wikimedia.org/T295555 [19:07:59] alrighty - my change is live -- shall I close this B&C window then? [19:08:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:08:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:44] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main [19:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:13] (03PS1) 10Andrew Bogott: Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546) [19:09:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:07] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main [19:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:36] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply on main [19:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:03] !log end of UTC evening backport & config window [19:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:04] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: sync on main [19:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:23] !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply on main [19:14:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:04] (03PS1) 10Volans: spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 [19:15:14] (03PS2) 10Andrew Bogott: Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546) [19:15:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:15:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:26] (03PS3) 10Andrew Bogott: Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546) [19:17:29] !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: sync on main [19:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:30] !log rebooting mx1001 to test new kernel [19:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:11] (03PS4) 10Andrew Bogott: Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546) [19:22:12] (03CR) 10jerkins-bot: [V: 04-1] spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans) [19:23:18] (03CR) 10Andrew Bogott: [C: 03+2] Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [19:30:33] jouncebot: now [19:30:33] For the next 0 hour(s) and 29 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1900) [19:31:30] (03PS1) 10Andrew Bogott: Provide cinder backup node list to rabbitmq in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/755759 (https://phabricator.wikimedia.org/T292546) [19:33:48] (03PS2) 10BryanDavis: wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185 [19:34:14] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) 05Open→03Resolved This is resolved! :) Proof is the footnote in https://static-bugzilla.wikimedia.org/ that is only shown when served from k8s. {F34924843} [19:34:19] (03CR) 10Andrew Bogott: [C: 03+2] Provide cinder backup node list to rabbitmq in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/755759 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [19:35:08] 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10FGoodwin) [19:35:25] (03CR) 10BryanDavis: [C: 03+2] wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185 (owner: 10BryanDavis) [19:36:27] (03Merged) 10jenkins-bot: wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185 (owner: 10BryanDavis) [19:38:27] !log bd808@deploy1002 Synchronized wmf-config/wikitech.php: wikitech: Remove password clear on block (duration: 00m 50s) [19:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:02] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:40:28] (03CR) 10Volans: "CI Failures are due to the latest dnspython 2.2.0 release 2 days ago." [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans) [19:42:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:43:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:21] (03PS2) 10Aaron Schulz: Simplify comments and stubs for etcd-defined DB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752212 [19:49:00] (03PS1) 10Dzahn: delete bugzilla_static after it moved from puppet to k8s [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) [20:00:04] jeena and twentyafterfour: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T2000) [20:02:58] MatmaRex: If https://phabricator.wikimedia.org/T299583 is resolved I'd like to backport it before doing the train [20:09:32] jeena: yes, please do [20:10:02] jeena: i'm afk for a moment, i'll be back in 30 minutes, but i don't think you'll need me for this? [20:10:14] thanks and sorry about the bug :) [20:10:15] are there 3 patches I need to backport? [20:10:17] or just the one? [20:10:36] just one, i think? [20:10:54] which would be the other ones? [20:11:11] ah okay, I thought there were more from the comments on your commit message [20:11:17] thanks! [20:11:19] oh, the two i mentioned there are already in wmf.18 [20:11:25] okay cool [20:11:27] and they're the cause of this bug [20:11:28] :D [20:11:33] haha [20:11:47] brb [20:13:51] (03PS1) 10Jeena Huneidi: Prevent assertion failure caused by empty headings [extensions/DiscussionTools] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755684 (https://phabricator.wikimedia.org/T299583) [20:16:23] (03PS1) 10RLazarus: Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 [20:17:33] (03CR) 10jerkins-bot: [V: 04-1] Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (owner: 10RLazarus) [20:19:12] (03CR) 10Jeena Huneidi: [C: 03+2] Prevent assertion failure caused by empty headings [extensions/DiscussionTools] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755684 (https://phabricator.wikimedia.org/T299583) (owner: 10Jeena Huneidi) [20:19:21] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [extensions/DiscussionTools] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755684 (https://phabricator.wikimedia.org/T299583) (owner: 10Jeena Huneidi) [20:24:01] (03Merged) 10jenkins-bot: Prevent assertion failure caused by empty headings [extensions/DiscussionTools] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755684 (https://phabricator.wikimedia.org/T299583) (owner: 10Jeena Huneidi) [20:25:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [20:26:40] (03CR) 10Umherirrender: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/755767 (https://phabricator.wikimedia.org/T282308) (owner: 10Umherirrender) [20:30:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:31:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:27] (03CR) 10BBlack: [C: 03+1] "LGTM in functional terms. Probably needs some confirmation from data eng that they're ready to have the new data appear in the webrequest" [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [20:31:31] !log jhuneidi@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/DiscussionTools/includes/HeadingItem.php: Backport: [[gerrit:755684|Prevent assertion failure caused by empty headings (T299583)]] (duration: 00m 50s) [20:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:34] T299583: Wikimedia\Assert\PreconditionException: Precondition failed: Range is not collapsed - https://phabricator.wikimedia.org/T299583 [20:32:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:56] (03PS1) 10Jeena Huneidi: all wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755787 [20:33:58] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755787 (owner: 10Jeena Huneidi) [20:34:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS buster [20:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster [20:34:55] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755787 (owner: 10Jeena Huneidi) [20:35:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [20:36:10] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.18 refs T293959 [20:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:13] T293959: 1.38.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T293959 [20:37:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:45] !log upgrading Cassandra to 3.11.11, aqs1010 -- T298516 [20:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:49] T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 [20:38:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:38:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:11] 10SRE, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) [20:40:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:40:13] 10SRE, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) [20:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:34] (03PS3) 10Herron: remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T299700) [20:41:25] thanks for backporting [20:41:31] (03PS1) 10Andrew Bogott: ceph: list cloudbackup2002 as a cinder backup node [puppet] - 10https://gerrit.wikimedia.org/r/755788 (https://phabricator.wikimedia.org/T292546) [20:41:41] thanks for the fix :) [20:42:05] (03CR) 10jerkins-bot: [V: 04-1] ceph: list cloudbackup2002 as a cinder backup node [puppet] - 10https://gerrit.wikimedia.org/r/755788 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [20:44:14] (03PS2) 10Andrew Bogott: ceph: list cloudbackup2002 as a cinder backup node [puppet] - 10https://gerrit.wikimedia.org/r/755788 (https://phabricator.wikimedia.org/T292546) [20:45:34] (03CR) 10Andrew Bogott: [C: 03+2] ceph: list cloudbackup2002 as a cinder backup node [puppet] - 10https://gerrit.wikimedia.org/r/755788 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [20:48:04] (03PS1) 10Herron: switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700) [20:49:04] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) [20:50:59] (03PS1) 10Herron: remove kibana.discovery.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/755790 (https://phabricator.wikimedia.org/T299700) [20:51:24] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) [20:54:40] (03PS1) 10Eigyan: [wmf-config]: Deploy fawiki test survey to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628) [20:58:03] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) [21:00:04] 10SRE, 10serviceops: Debian package for httpbb - https://phabricator.wikimedia.org/T299705 (10RLazarus) p:05Triage→03Medium [21:00:38] 10SRE, 10serviceops: Debian package for httpbb - https://phabricator.wikimedia.org/T299705 (10RLazarus) [21:00:43] 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10RLazarus) [21:01:01] (03PS2) 10RLazarus: Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705) [21:01:41] (03PS2) 10Eigyan: [wmf-config]: Deploy fawiki test survey to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628) [21:02:07] (03CR) 10jerkins-bot: [V: 04-1] Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [21:04:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1008.eqiad.wmnet with OS buster [21:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster executed with errors: -... [21:06:23] (03PS1) 10Addshore: Add mwcli.command_execute to wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755794 (https://phabricator.wikimedia.org/T293583) [21:09:32] (03PS3) 10RLazarus: Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705) [21:09:34] (03PS1) 10RLazarus: tox: Run mypy only in the source directory and exclude .eggs from flake8 [software/httpbb] - 10https://gerrit.wikimedia.org/r/755796 [21:28:39] (03CR) 10Ebernhardson: [C: 03+1] "verify_commit is happy, builds a deb. Installed elasticsearch-oss 6.8.23 along with this package to cirrus-integ02,loads up happy enough." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) (owner: 10EJoseph) [21:31:01] (03CR) 10Ottomata: [C: 03+1] Add mwcli.command_execute to wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755794 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore) [21:45:51] (03CR) 10Ebernhardson: [C: 03+1] Upgrade to elasticsearh 6.8.23 (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) (owner: 10EJoseph) [21:50:14] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:50:43] (03PS1) 10Eevans: Pin Cassandra 3.11.11 as 'dev' [puppet] - 10https://gerrit.wikimedia.org/r/755800 (https://phabricator.wikimedia.org/T298516) [21:53:32] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:56:40] (03CR) 10Eevans: [C: 03+1] "PCC output: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33372/console" [puppet] - 10https://gerrit.wikimedia.org/r/755800 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans) [21:59:47] Puppet is in WARN on aqs1010, if there is anyone around that can +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/755800, we could resolve that [22:00:08] So...is there? :) [22:05:15] o/ [22:05:27] (03CR) 10Cwhite: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/755800 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans) [22:06:00] urandom: done [22:06:18] cwhite: awesome; thanks! [22:15:03] (03PS1) 10Ryan Kemper: wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/755806 (https://phabricator.wikimedia.org/T282117) [22:15:29] (03CR) 10Cwhite: [C: 03+2] logstash: ensure dlq directory exists [puppet] - 10https://gerrit.wikimedia.org/r/753571 (owner: 10Cwhite) [22:17:11] (03CR) 10Bking: [V: 03+1] wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/755806 (https://phabricator.wikimedia.org/T282117) (owner: 10Ryan Kemper) [22:17:22] (03CR) 10Cwhite: [V: 03+2 C: 03+2] bump patch version to update plugins [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755033 (owner: 10Cwhite) [22:22:36] (03CR) 10Cwhite: [C: 03+2] builder: add opensearch1 pbuilder hooks for logstash-plugins update [puppet] - 10https://gerrit.wikimedia.org/r/755043 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [22:26:30] (03CR) 10Ryan Kemper: "We will merge this when we're ready to go from monitoring_setup to production. Currently we're in lvs_setup going into monitoring_setup so" [dns] - 10https://gerrit.wikimedia.org/r/755806 (https://phabricator.wikimedia.org/T282117) (owner: 10Ryan Kemper) [22:27:08] !log rolling restart of Cassandra, aqs-next -- T298516 [22:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:14] T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 [22:33:13] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10tstarling) >>! In T292322#7636542, @Joe wrote: > But given in reality I was proposing to do something like: > > signature = md5sum( secret + padding + re... [22:35:22] (03PS1) 10Bking: wcqs: Move back from lvs_setup to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/755810 (https://phabricator.wikimedia.org/T280001) [22:36:19] (03CR) 10Ryan Kemper: [C: 03+1] wcqs: Move back from lvs_setup to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/755810 (https://phabricator.wikimedia.org/T280001) (owner: 10Bking) [22:36:41] (03CR) 10Bking: [C: 03+2] wcqs: Move back from lvs_setup to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/755810 (https://phabricator.wikimedia.org/T280001) (owner: 10Bking) [22:38:36] !log running puppet-merge for ^^ [22:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:20] !log running puppet-merge for https://gerrit.wikimedia.org/r/755810 [22:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:01] (03PS1) 10Cwhite: logstash: install logstash-plugins on logging logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/755811 (https://phabricator.wikimedia.org/T299168) [22:51:32] (03PS1) 10Cwhite: logstash: switch to opensearch output plugin on production logstash [puppet] - 10https://gerrit.wikimedia.org/r/755812 (https://phabricator.wikimedia.org/T299168) [22:53:27] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:57:35] (03CR) 10Jforrester: [C: 03+1] Undeploy UserMerge (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755532 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [23:05:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) [23:51:29] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:58:11] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 875942 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops