[00:00:05] <jouncebot>	 RoanKattouw and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T0000).
[00:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[00:11:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install labstore100[89] - https://phabricator.wikimedia.org/T299610 (10Andrew) Rack and network looks right to me. We might be renaming these hosts but I'll get the task retitled before the servers show up.
[00:27:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_network_flows_internal-sanitization_daily.service,eventlogging_to_druid_network_flows_internal_daily.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:28:34] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:32:44] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7288 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[00:47:20] <icinga-wm>	 PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service,curator_actions_apifeatureusage_eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:14] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 5.671e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[00:53:40] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: miscweb1002, labstore1006, labstore1007, build2001, wdqs1010 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[01:00:05] <jouncebot>	 twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T0100).
[01:12:06] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+1] "Should work, approved for self-merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752212 (owner: 10Aaron Schulz)
[01:30:15] <wikibugs>	 (03PS3) 10Juan90264: Create Draft namespace for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755413
[01:31:26] <wikibugs>	 (03PS4) 10Juan90264: Create Draft namespace for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755413 (https://phabricator.wikimedia.org/T299224)
[01:36:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[01:36:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:39:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:46:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:48:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:53:46] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.067e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:03:46] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7406 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[02:27:36] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7175 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[02:29:56] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11170 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[03:00:31] <wikibugs>	 (03CR) 10Krinkle: Benchmark loading DefaultSettings from YAML (031 comment) [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko)
[03:19:22] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: wdqs1010, miscweb1002, build2001, labstore1006, labstore1007 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[03:32:38] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:00:12] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7377 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[04:07:18] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11118 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[04:15:12] <icinga-wm>	 PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:52:30] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, labstore1006, wdqs1010, labstore1007, miscweb1002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[06:09:29] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755418
[06:10:56] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Disable notifications on a few s6 hosts" [puppet] - 10https://gerrit.wikimedia.org/r/755419
[06:11:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755418 (owner: 10Marostegui)
[06:11:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications on a few s6 hosts" [puppet] - 10https://gerrit.wikimedia.org/r/755419 (owner: 10Marostegui)
[06:14:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[06:14:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[06:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18896 and previous config saved to /var/cache/conftool/dbconfig/20220120-061407-marostegui.json
[06:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:11] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[06:15:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1180 T299479', diff saved to https://phabricator.wikimedia.org/P18897 and previous config saved to /var/cache/conftool/dbconfig/20220120-061529-marostegui.json
[06:15:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:33] <stashbot>	 T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479
[06:15:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18898 and previous config saved to /var/cache/conftool/dbconfig/20220120-061538-marostegui.json
[06:15:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:43] <wikibugs>	 (03PS1) 10Marostegui: db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755525 (https://phabricator.wikimedia.org/T299479)
[06:17:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1180.eqiad.wmnet with OS bullseye
[06:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755525 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui)
[06:30:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P18899 and previous config saved to /var/cache/conftool/dbconfig/20220120-063042-marostegui.json
[06:30:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:17] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:45:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P18900 and previous config saved to /var/cache/conftool/dbconfig/20220120-064547-marostegui.json
[06:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1180.eqiad.wmnet with OS bullseye
[06:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:27] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755421
[06:51:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755421 (owner: 10Marostegui)
[06:54:48] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=get https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[06:55:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18901 and previous config saved to /var/cache/conftool/dbconfig/20220120-065551-root.json
[06:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18902 and previous config saved to /var/cache/conftool/dbconfig/20220120-070052-marostegui.json
[07:00:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:00:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:00:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:57] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[07:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:01:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:01:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[07:01:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[07:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18903 and previous config saved to /var/cache/conftool/dbconfig/20220120-070119-marostegui.json
[07:01:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18904 and previous config saved to /var/cache/conftool/dbconfig/20220120-070231-marostegui.json
[07:02:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:16] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[07:07:16] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[07:10:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18905 and previous config saved to /var/cache/conftool/dbconfig/20220120-071054-root.json
[07:10:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:40] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[07:17:06] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[07:17:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18906 and previous config saved to /var/cache/conftool/dbconfig/20220120-071736-marostegui.json
[07:17:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:04] <icinga-wm>	 RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:22:07] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1128 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/755526 (https://phabricator.wikimedia.org/T299344)
[07:23:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1128 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/755526 (https://phabricator.wikimedia.org/T299344) (owner: 10Marostegui)
[07:24:58] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[07:26:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18907 and previous config saved to /var/cache/conftool/dbconfig/20220120-072558-root.json
[07:26:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[07:28:50] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[07:30:11] <wikibugs>	 10SRE, 10envoy, 10serviceops: The TLS proxy configuration in deployment-charts allows invalid listeners - https://phabricator.wikimedia.org/T291959 (10Joe) a:03Joe
[07:30:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: _tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959)
[07:31:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] _tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) (owner: 10Giuseppe Lavagetto)
[07:31:56] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[07:32:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1128.eqiad.wmnet with OS bullseye
[07:32:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18908 and previous config saved to /var/cache/conftool/dbconfig/20220120-073241-marostegui.json
[07:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:52] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[07:41:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18909 and previous config saved to /var/cache/conftool/dbconfig/20220120-074105-root.json
[07:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18910 and previous config saved to /var/cache/conftool/dbconfig/20220120-074746-marostegui.json
[07:47:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[07:47:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[07:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:52] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[07:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T285149)', diff saved to https://phabricator.wikimedia.org/P18911 and previous config saved to /var/cache/conftool/dbconfig/20220120-074753-marostegui.json
[07:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:57] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10MMandere) 05Open→03Resolved a:03MMandere We now have varnish upgraded from `6.0.8` to `6.0.9` in all our cache instances (across all datacent...
[07:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T285149)', diff saved to https://phabricator.wikimedia.org/P18912 and previous config saved to /var/cache/conftool/dbconfig/20220120-075005-marostegui.json
[07:50:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:06] <wikibugs>	 (03PS2) 10Muehlenhoff: Make ganeti1024 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755440 (https://phabricator.wikimedia.org/T283036)
[07:56:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18913 and previous config saved to /var/cache/conftool/dbconfig/20220120-075609-root.json
[07:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:14] <marostegui>	 !log Stop mysql on db1117 to clone db1128 T299344
[07:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:17] <stashbot>	 T299344: Upgrade m1 to Bullseye - https://phabricator.wikimedia.org/T299344
[07:57:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1024 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755440 (https://phabricator.wikimedia.org/T283036) (owner: 10Muehlenhoff)
[07:59:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpbb: remove tests that fail under k8s [puppet] - 10https://gerrit.wikimedia.org/r/755529 (https://phabricator.wikimedia.org/T285298)
[07:59:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1128.eqiad.wmnet with OS bullseye
[07:59:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:00] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[08:02:26] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not format db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755530 (https://phabricator.wikimedia.org/T299344)
[08:02:28] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[08:02:34] <marostegui>	 haproxy alerts are expected
[08:02:46] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[08:02:48] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[08:03:24] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[08:03:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755530 (https://phabricator.wikimedia.org/T299344) (owner: 10Marostegui)
[08:03:38] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[08:03:58] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[08:03:58] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[08:04:13] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[08:04:13] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[08:04:13] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[08:04:13] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[08:04:13] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[08:04:13] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[08:04:14] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[08:05:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18915 and previous config saved to /var/cache/conftool/dbconfig/20220120-080510-marostegui.json
[08:05:11] <wikibugs>	 (03PS1) 10Majavah: Undeploy UserMerge (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755532 (https://phabricator.wikimedia.org/T216089)
[08:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:13] <wikibugs>	 (03PS1) 10Majavah: Undeploy UserMerge (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755533 (https://phabricator.wikimedia.org/T216089)
[08:05:15] <wikibugs>	 (03PS1) 10Majavah: Undeploy UserMerge (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755534 (https://phabricator.wikimedia.org/T216089)
[08:09:18] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Jelto) p:05Triage→03Medium
[08:10:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deployment-prep: install php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/755536 (https://phabricator.wikimedia.org/T295578)
[08:11:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18916 and previous config saved to /var/cache/conftool/dbconfig/20220120-081112-root.json
[08:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Jelto) Thanks for the access request. But there is no group named `analytics-private-data`. I assume you mean `analytics-privatedata-users`, is that correct?  If y...
[08:18:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 for on-site maintenance T299123', diff saved to https://phabricator.wikimedia.org/P18917 and previous config saved to /var/cache/conftool/dbconfig/20220120-081809-marostegui.json
[08:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:14] <stashbot>	 T299123: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123
[08:19:39] <wikibugs>	 (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755630 (https://phabricator.wikimedia.org/T299123)
[08:20:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18918 and previous config saved to /var/cache/conftool/dbconfig/20220120-082015-marostegui.json
[08:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755630 (https://phabricator.wikimedia.org/T299123) (owner: 10Marostegui)
[08:25:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[08:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18919 and previous config saved to /var/cache/conftool/dbconfig/20220120-082616-root.json
[08:26:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:03] <wikibugs>	 (03PS1) 10Elukey: knative-serving,kserve-inference: move _helpers.tpl to 0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/755638 (https://phabricator.wikimedia.org/T292390)
[08:28:05] <wikibugs>	 (03CR) 10Majavah: "{{ping}}" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[08:29:10] <wikibugs>	 (03Abandoned) 10Elukey: WIP - kserve-inference: add support for local tls proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/741092 (owner: 10Elukey)
[08:33:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[08:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T285149)', diff saved to https://phabricator.wikimedia.org/P18920 and previous config saved to /var/cache/conftool/dbconfig/20220120-083520-marostegui.json
[08:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:25] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[08:35:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[08:35:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[08:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance
[08:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance
[08:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:39] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-toolforge-tests: fix NFS mount point [puppet] - 10https://gerrit.wikimedia.org/r/755639
[08:35:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[08:35:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[08:35:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18921 and previous config saved to /var/cache/conftool/dbconfig/20220120-083558-marostegui.json
[08:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:20] <wikibugs>	 (03CR) 10Majavah: "can't we just use /data/project/automated-toolforge-tests for both projects?" [puppet] - 10https://gerrit.wikimedia.org/r/755639 (owner: 10Arturo Borrero Gonzalez)
[08:37:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18922 and previous config saved to /var/cache/conftool/dbconfig/20220120-083711-marostegui.json
[08:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:16] <wikibugs>	 (03PS1) 10Elukey: helmfile.d: remove secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755640 (https://phabricator.wikimedia.org/T298976)
[08:41:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18923 and previous config saved to /var/cache/conftool/dbconfig/20220120-084120-root.json
[08:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-toolforge-tests: fix NFS mount point (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755639 (owner: 10Arturo Borrero Gonzalez)
[08:44:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] helmfile.d: remove secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755640 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[08:45:13] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) Using a known broken hash like MD5 seems wrong in what's supposed to be a security-sensitive application. Since we are already calculating the SH...
[08:46:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet
[08:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:16] <wikibugs>	 (03PS3) 10Legoktm: P:mw::maintenance: add centralauth group purge job [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[08:47:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update codfw kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[08:47:22] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: drop leftover hash mention [puppet] - 10https://gerrit.wikimedia.org/r/755642
[08:48:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[08:48:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[08:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:22] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] P:mw::maintenance: add centralauth group purge job [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[08:49:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet
[08:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:43] <wikibugs>	 (03PS2) 10Elukey: admin_ng: remove the secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755441 (https://phabricator.wikimedia.org/T298976)
[08:51:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[08:52:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18924 and previous config saved to /var/cache/conftool/dbconfig/20220120-085215-marostegui.json
[08:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[08:52:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:59] <wikibugs>	 (03CR) 10Legoktm: "legoktm@mwmaint1002:~$ systemctl status mediawiki_job_purge_expired_global_rights" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[08:53:13] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:53:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:55:41] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:55:49] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet
[08:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:56:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18925 and previous config saved to /var/cache/conftool/dbconfig/20220120-085623-root.json
[08:56:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: remove the secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755441 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[08:58:01] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet
[08:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:09] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:00:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[09:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[09:00:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:29] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[09:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[09:00:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:26] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestagemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[09:01:27] <wikibugs>	 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Debug / fine tune puppet failed metrics and alerts on alert* hosts - https://phabricator.wikimedia.org/T299628 (10fgiunchedi)
[09:03:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving,kserve-inference: move _helpers.tpl to 0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/755638 (https://phabricator.wikimedia.org/T292390) (owner: 10Elukey)
[09:05:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[09:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[09:05:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:41] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:52] <jayme>	 that's me
[09:06:26] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[09:07:10] <wikibugs>	 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Debug / fine tune puppet failed metrics and alerts on alert* hosts - https://phabricator.wikimedia.org/T299628 (10Majavah) I've noticed that when puppet fails to compile catalog, it won't show as failed but will have 0 resources, which is what happened here...
[09:07:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18926 and previous config saved to /var/cache/conftool/dbconfig/20220120-090720-marostegui.json
[09:07:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[09:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:07:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[09:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[09:08:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[09:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: drop leftover hash mention [puppet] - 10https://gerrit.wikimedia.org/r/755642 (owner: 10Arturo Borrero Gonzalez)
[09:11:02] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10elukey) knative-serving and kserve-inference should be done! :)
[09:11:09] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: wmcs: factorize common arguments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754473
[09:11:11] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: introduce cookbook to repool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754555 (https://phabricator.wikimedia.org/T298948)
[09:11:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18927 and previous config saved to /var/cache/conftool/dbconfig/20220120-091127-root.json
[09:11:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:19] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:15:55] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:17:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: factorize common arguments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754473 (owner: 10Arturo Borrero Gonzalez)
[09:18:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: introduce cookbook to repool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754555 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez)
[09:18:35] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:22:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T285149)', diff saved to https://phabricator.wikimedia.org/P18928 and previous config saved to /var/cache/conftool/dbconfig/20220120-092225-marostegui.json
[09:22:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[09:22:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[09:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:30] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[09:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T285149)', diff saved to https://phabricator.wikimedia.org/P18929 and previous config saved to /var/cache/conftool/dbconfig/20220120-092232-marostegui.json
[09:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:26] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[09:30:58] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:32:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 6.4.5 and enable webauthn [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644
[09:33:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::deep_merge: add a deep merge that support arrays [puppet] - 10https://gerrit.wikimedia.org/r/747525 (owner: 10Jbond)
[09:33:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: skip tcp module, already in http module [puppet] - 10https://gerrit.wikimedia.org/r/755645 (https://phabricator.wikimedia.org/T291946)
[09:34:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: skip tcp module, already in http module [puppet] - 10https://gerrit.wikimedia.org/r/755645 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:36:46] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:36:56] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:37:02] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:38:25] <wikibugs>	 (03PS1) 10Jbond: pcc: make positionals optional [puppet] - 10https://gerrit.wikimedia.org/r/755646
[09:38:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33353/console" [puppet] - 10https://gerrit.wikimedia.org/r/755403 (owner: 10Jbond)
[09:39:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:docker::reporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/755403 (owner: 10Jbond)
[09:39:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pcc: make positionals optional [puppet] - 10https://gerrit.wikimedia.org/r/755646 (owner: 10Jbond)
[09:42:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn)
[09:49:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1018.eqiad.wmnet with OS buster
[09:49:01] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ganeti1018.eqiad.wmnet with OS buster
[09:49:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1018.eqiad.wmnet with OS buster
[09:50:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Allow configuring delayed_closed_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:50:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS buster
[09:50:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] envoy: Allow configuring delayed_closed_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:53:39] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache::envoy: Set the delayed_close_timeout to 20s [puppet] - 10https://gerrit.wikimedia.org/r/755340 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:54:16] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "should also update the version of Gradle in /gradle/wrapper/gradle-wrapper.properties" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 (owner: 10Muehlenhoff)
[09:55:27] <wikibugs>	 (03PS3) 10Jbond: P:rsyslog: add squid to the list of programs sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087)
[09:56:12] <wikibugs>	 (03PS4) 10Jbond: P:rsyslog: add squid to the list of programs sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087)
[09:56:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T285149)', diff saved to https://phabricator.wikimedia.org/P18930 and previous config saved to /var/cache/conftool/dbconfig/20220120-095652-marostegui.json
[09:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:57] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[09:57:24] <wikibugs>	 (03CR) 10Jbond: P:rsyslog: add squid to the list of programs sent to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond)
[09:57:50] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:59:00] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[10:02:14] <_joe_>	 uhm
[10:02:59] <_joe_>	 just a delete that took more than 100 ms
[10:03:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[10:05:28] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[10:07:25] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[10:11:26] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubestagemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[10:11:45] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Deploy the dev version of cassandra to aqs1010.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis)
[10:11:57] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[10:11:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18931 and previous config saved to /var/cache/conftool/dbconfig/20220120-101157-marostegui.json
[10:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:37] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:05] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: introduce cookbook to run tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948)
[10:16:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: __init__: run black -l120 [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755647
[10:16:09] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: refactor cmd-checklist-runner operations [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755648
[10:17:35] <wikibugs>	 (03PS1) 10Jbond: O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650
[10:18:03] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[10:18:21] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[10:18:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond)
[10:19:17] <wikibugs>	 (03PS1) 10Elukey: role::pki::root: add the ml_serve intermediate PKI [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976)
[10:19:27] <elukey>	 ah snap bad timing :)
[10:20:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: __init__: run black -l120 [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755647 (owner: 10Arturo Borrero Gonzalez)
[10:21:15] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[10:21:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: refactor cmd-checklist-runner operations [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755648 (owner: 10Arturo Borrero Gonzalez)
[10:21:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: introduce cookbook to run tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez)
[10:22:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Deploy the dev version of cassandra to aqs1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis)
[10:26:09] <icinga-wm>	 PROBLEM - Disk space on kubestagemaster2001 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/e5cb0fdd-6df9-42f5-8a50-01bff58133e0/volumes/kubernetes.iosecret/calico-node-token-5fmsz is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubestagemaster2001&var-datasource=codfw+prometheus/ops
[10:27:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18932 and previous config saved to /var/cache/conftool/dbconfig/20220120-102702-marostegui.json
[10:27:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:47] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: tests: use sudo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755653
[10:32:21] <wikibugs>	 (03PS2) 10Jbond: O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650
[10:33:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33355/console" [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond)
[10:33:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond)
[10:33:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: tests: use sudo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755653 (owner: 10Arturo Borrero Gonzalez)
[10:34:11] <wikibugs>	 (03PS1) 10Ayounsi: Bump Atlas exporter scrape_timeout from 10 to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755654 (https://phabricator.wikimedia.org/T251156)
[10:35:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add kubestagemaster2001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[10:36:33] <wikibugs>	 (03Merged) 10jenkins-bot: Add kubestagemaster2001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[10:38:26] <wikibugs>	 (03PS1) 10Btullis: Stop writing parquet logs to files [puppet] - 10https://gerrit.wikimedia.org/r/755655 (https://phabricator.wikimedia.org/T297734)
[10:41:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add Host header support to probes [puppet] - 10https://gerrit.wikimedia.org/r/755656 (https://phabricator.wikimedia.org/T291946)
[10:42:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T285149)', diff saved to https://phabricator.wikimedia.org/P18933 and previous config saved to /var/cache/conftool/dbconfig/20220120-104206-marostegui.json
[10:42:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:42:11] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[10:42:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T285149)', diff saved to https://phabricator.wikimedia.org/P18934 and previous config saved to /var/cache/conftool/dbconfig/20220120-104220-marostegui.json
[10:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:39] <wikibugs>	 (03PS2) 10Btullis: Stop writing parquet logs to files [puppet] - 10https://gerrit.wikimedia.org/r/755655 (https://phabricator.wikimedia.org/T297734)
[10:42:45] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[10:43:27] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33357/console" [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[10:43:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T285149)', diff saved to https://phabricator.wikimedia.org/P18935 and previous config saved to /var/cache/conftool/dbconfig/20220120-104332-marostegui.json
[10:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:33] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[10:45:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1018.eqiad.wmnet with OS buster
[10:45:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add tls port for cloud vps rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/755478 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah)
[10:45:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS buster completed: - ganeti1018 (**PASS**)...
[10:45:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[10:46:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33356/console" [puppet] - 10https://gerrit.wikimedia.org/r/755656 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[10:47:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: add Host header support to probes [puppet] - 10https://gerrit.wikimedia.org/r/755656 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[10:49:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Update automatic Icinga LLDP hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/755342 (owner: 10Ayounsi)
[10:50:11] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[10:52:20] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[10:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:28] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s)
[10:52:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Bump Atlas exporter scrape_timeout from 10 to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755654 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi)
[10:53:27] <wikibugs>	 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10TheDJ) I've removed graphoid info from https://www.mediawiki.org/wiki/Extension:Graph to avoid further confusion for read...
[10:55:51] <wikibugs>	 (03CR) 10Elukey: "Hi folks! Any plan for the deployment?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[10:55:59] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[10:56:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Bump Atlas exporter scrape_timeout from 10 to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755654 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi)
[10:58:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18936 and previous config saved to /var/cache/conftool/dbconfig/20220120-105837-marostegui.json
[10:58:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:53] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) >>! In T299527#7633551, @Cmjohnson wrote: > I updated the firmware on 1018   Thanks, with the updated firmware I was able to reim...
[11:00:05] <jouncebot>	 mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1100)
[11:06:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Stop writing parquet logs to files [puppet] - 10https://gerrit.wikimedia.org/r/755655 (https://phabricator.wikimedia.org/T297734) (owner: 10Btullis)
[11:06:54] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/754957 (owner: 10PipelineBot)
[11:09:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: probe with http host override [puppet] - 10https://gerrit.wikimedia.org/r/755657 (https://phabricator.wikimedia.org/T291946)
[11:09:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add probes for non-lvs services [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946)
[11:10:46] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/754957 (owner: 10PipelineBot)
[11:13:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18937 and previous config saved to /var/cache/conftool/dbconfig/20220120-111341-marostegui.json
[11:13:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:54] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging
[11:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:56] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production
[11:13:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:27] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Also check for the analytics bridge in eqiad [cookbooks] - 10https://gerrit.wikimedia.org/r/755442
[11:14:35] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.addnode: Also check for the analytics bridge in eqiad (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 (owner: 10Muehlenhoff)
[11:15:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 (owner: 10Muehlenhoff)
[11:16:39] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[11:16:51] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: sync on staging
[11:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:02] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply on production
[11:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:05] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply on staging
[11:18:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:12] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[11:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:21] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s)
[11:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:36] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: sync on production
[11:19:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33358/console" [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[11:20:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Return a set, not a list, from active_images() [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748873 (owner: 10RLazarus)
[11:21:53] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[11:21:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:56] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 03s)
[11:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:02] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[11:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:11] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s)
[11:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:08] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply on production
[11:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:10] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply on staging
[11:23:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:18] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: sync on production
[11:24:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:38] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/723607 (owner: 10PipelineBot)
[11:25:49] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/723658 (owner: 10PipelineBot)
[11:28:09] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[11:28:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T285149)', diff saved to https://phabricator.wikimedia.org/P18938 and previous config saved to /var/cache/conftool/dbconfig/20220120-112846-marostegui.json
[11:28:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:28:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:28:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:50] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[11:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:52] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T285149)', diff saved to https://phabricator.wikimedia.org/P18939 and previous config saved to /var/cache/conftool/dbconfig/20220120-112854-marostegui.json
[11:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:56] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[11:28:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:38] <wikibugs>	 (03PS3) 10Jbond: O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650
[11:30:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T285149)', diff saved to https://phabricator.wikimedia.org/P18940 and previous config saved to /var/cache/conftool/dbconfig/20220120-113006-marostegui.json
[11:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33359/console" [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond)
[11:30:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet
[11:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond)
[11:30:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[11:30:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:21] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[11:33:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Also check for the analytics bridge in eqiad [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 (owner: 10Muehlenhoff)
[11:35:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet
[11:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:32] <wikibugs>	 (03PS5) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743356
[11:38:29] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Stop writing parquet logs to files [puppet] - 10https://gerrit.wikimedia.org/r/755655 (https://phabricator.wikimedia.org/T297734) (owner: 10Btullis)
[11:39:16] <wikibugs>	 (03PS4) 10Jbond: O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650
[11:39:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33360/console" [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond)
[11:41:19] <icinga-wm>	 PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:43:19] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[11:45:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18941 and previous config saved to /var/cache/conftool/dbconfig/20220120-114510-marostegui.json
[11:45:13] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Rename main cluster to wikikube (1/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris)
[11:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] O:pki::multirootca: update config to inject default profile options [puppet] - 10https://gerrit.wikimedia.org/r/755650 (owner: 10Jbond)
[11:49:51] <moritzm>	 !log add ganeti1024 to Ganeti eqiad cluster T283036
[11:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:54] <stashbot>	 T283036: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036
[11:49:55] <wikibugs>	 (03PS3) 10Jbond: Do NOT MERGE "role::pki::multirootca: add expiry for k8s_mlserve" [puppet] - 10https://gerrit.wikimedia.org/r/755408
[11:51:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33361/console" [puppet] - 10https://gerrit.wikimedia.org/r/755408 (owner: 10Jbond)
[11:54:12] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[11:55:46] <mvolz>	 I deployed an update that I think broke metrics: https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=30s&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid
[11:55:56] <mvolz>	 was supposed to be backwards compatible
[11:56:06] <mvolz>	 do I revert for the time being? 
[11:57:00] <mvolz>	 jelto: what do you think? 
[11:57:54] <wikibugs>	 (03PS1) 10Mvolz: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755666
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, and apergos: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1200).
[12:00:04] <jouncebot>	 noa_wmde: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18942 and previous config saved to /var/cache/conftool/dbconfig/20220120-120015-marostegui.json
[12:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:29] <Lucas_WMDE>	 o/
[12:00:37] <Lucas_WMDE>	 noa is having IRC troubles but will hopefully join soon
[12:00:41] <Lucas_WMDE>	 (and I can deploy)
[12:01:03] <Lucas_WMDE>	 mvolz: for now I’m not deploying yet and you’re good to go if you need to roll something back
[12:01:30] <mvolz>	 Lucas_WMDE: I think I will, thanks :)
[12:01:47] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755666 (owner: 10Mvolz)
[12:04:11] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging
[12:04:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:14] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production
[12:04:15] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on staging
[12:04:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:18] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[12:04:24] <mvolz>	 I *did* think it might break metrics and even checked after but it took longer than I thought to show up and then moved on. 🙄 sorry for overlapping
[12:05:18] <Lucas_WMDE>	 (looks like nobody signed up for training today btw)
[12:05:28] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging
[12:05:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:30] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production
[12:05:31] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on staging
[12:05:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755666 (owner: 10Mvolz)
[12:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:55] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging
[12:05:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:57] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production
[12:05:58] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on staging
[12:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:06] <apergos>	 excuse me the previous meeting ran over
[12:06:20] <apergos>	 there is one patch for the window that is a config patch, and no trainees scheduled
[12:06:26] <apergos>	 the one patch looked straightforward to me
[12:06:36] <Lucas_WMDE>	 yup, I’ll deploy it once noa joins
[12:06:40] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply on staging
[12:06:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:42] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply on production
[12:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:11] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: sync on staging
[12:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:16] <apergos>	 (the previous meeting is actually still going, I am trying to partoicupate in a complicated db config discussion while being here, heh)
[12:08:08] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply on production
[12:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:11] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply on staging
[12:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:01] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: sync on production
[12:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:55] <apergos>	 hey Noa_WMDE
[12:10:00] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply on production
[12:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:02] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply on staging
[12:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:05] <apergos>	 are you here for your config patch?
[12:10:30] <Noa_WMDE>	 Hi apergos, yes
[12:10:43] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: sync on production
[12:10:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:23] <Lucas_WMDE>	 mvolz: was that the last sync?
[12:11:24] <mvolz>	 I am hopefully done now 
[12:11:28] <Lucas_WMDE>	 ok
[12:11:30] <apergos>	 You're the only one in the window, I believe Lucas_WMDE is doing actual deploys, if you don't have the rights
[12:11:59] <Noa_WMDE>	 yep, that's the plan. thanks!
[12:12:30] <Lucas_WMDE>	 hm, my `logspam-watch` is being slow to start it seems
[12:12:36] * Lucas_WMDE tries in another SSH connection
[12:13:58] <Lucas_WMDE>	 ok now it loaded
[12:14:13] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable usage tracking for statements in Waray Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755322 (https://phabricator.wikimedia.org/T296383) (owner: 10Noa wmde)
[12:14:14] <apergos>	 \o/
[12:15:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T285149)', diff saved to https://phabricator.wikimedia.org/P18943 and previous config saved to /var/cache/conftool/dbconfig/20220120-121520-marostegui.json
[12:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:24] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[12:15:51] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable usage tracking for statements in Waray Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755322 (https://phabricator.wikimedia.org/T296383) (owner: 10Noa wmde)
[12:16:24] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[12:16:54] <wikibugs>	 (03Merged) 10jenkins-bot: Enable usage tracking for statements in Waray Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755322 (https://phabricator.wikimedia.org/T296383) (owner: 10Noa wmde)
[12:17:31] <Lucas_WMDE>	 Noa_WMDE: alright, the change is on mwdebug1001 now
[12:17:33] <Lucas_WMDE>	 do you know how to test it?
[12:18:12] <Noa_WMDE>	 not more than keeping an eye on the dashboard no
[12:18:28] <Lucas_WMDE>	 have you used the WikimediaDebug extension before?
[12:18:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[12:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:51] <Noa_WMDE>	 Is that the one where you can change servers?
[12:18:54] <Lucas_WMDE>	 yeah
[12:19:10] <Noa_WMDE>	 I think so but I need to find out if it's installed
[12:19:18] <Lucas_WMDE>	 and I think it should be possible to test this change by purging a warwiki page on mwdebug1001 and then looking at action=info to see which entity usage it now has
[12:19:23] <Lucas_WMDE>	 we just need to find a page that uses Wikidata statements
[12:19:59] <Noa_WMDE>	 okay it's installed
[12:20:48] <Noa_WMDE>	 can I add a page and purge it directly?
[12:21:06] <Lucas_WMDE>	 well, it needs to be a page that uses Wikidata
[12:21:38] <Lucas_WMDE>	 looks like https://war.wikipedia.org/wiki/Sangkalibutan has an “other” usage on Q1
[12:21:44] <Lucas_WMDE>	 so we can try that one
[12:22:05] <Lucas_WMDE>	 add ?action=purge with the extension enabled and set to mwdebug1001
[12:22:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[12:22:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[12:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[12:23:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:54] <Noa_WMDE>	 I completely purge (it's on mwdebug1001)
[12:24:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: drop k8s-specific default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755692
[12:24:13] <Lucas_WMDE>	 hm, so far https://war.wikipedia.org/w/index.php?title=Sangkalibutan&action=info still looks like an “other” usage
[12:24:13] <Noa_WMDE>	 okay, cache purged.
[12:24:15] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: refresh header comment [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755693
[12:24:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: migrate create_instance_with_prefix to CommonOpts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755694
[12:24:47] <Lucas_WMDE>	 let’s try https://war.wikipedia.org/wiki/Orl%C3%A9ans ? (one of few pages linking to a Module:Wd, apparently)
[12:25:14] <Noa_WMDE>	 ok
[12:26:24] <Noa_WMDE>	 purged
[12:26:36] <Noa_WMDE>	 where in the info can you see the usage type?
[12:26:45] <Lucas_WMDE>	 under “Wikidata entities used in this page”
[12:26:55] <Lucas_WMDE>	 and so far it still looks like “other” usage :/
[12:27:58] <Lucas_WMDE>	 let’s use a sandbox page so we know it uses statements https://war.wikipedia.org/wiki/Gumaramit:Lucas_Werkmeister_(WMDE)/sandbox
[12:28:18] <Lucas_WMDE>	 yay, there’s a statement usage in https://war.wikipedia.org/w/index.php?title=Gumaramit:Lucas_Werkmeister_(WMDE)/sandbox&action=info
[12:28:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[12:28:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:16] <Lucas_WMDE>	 I think that’s good enough to deploy
[12:29:17] <Noa_WMDE>	 purged
[12:29:28] <Noa_WMDE>	 yeah I saw a statement
[12:30:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[12:30:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[12:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:05] <Noa_WMDE>	 I guess it's just a very specific case to find live examples for
[12:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:18] <Lucas_WMDE>	 yeah
[12:30:24] <Lucas_WMDE>	 syncing
[12:30:53] <Lucas_WMDE>	 there are three deprecations at the top of logspam-watch btw, one of them witk 36k occurrences in the past hour
[12:30:58] <Lucas_WMDE>	 I assume someone™ is taking care of those
[12:31:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755322|Enable usage tracking for statements in Waray Wikipedia (T296383)]] (expecting some gradual increase of wbc_entity_usage rows on warwiki) (duration: 00m 51s)
[12:31:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[12:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:17] <stashbot>	 T296383: Enable statement usage tracking on warwiki - https://phabricator.wikimedia.org/T296383
[12:31:18] <Lucas_WMDE>	 and not just adding hard deprecations to prod and then leaving them there
[12:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:05] <Lucas_WMDE>	 purging my sandbox without mwdebug now
[12:32:14] <Lucas_WMDE>	 entity usage still has statements, yay
[12:33:32] <Noa_WMDE>	 \o/
[12:35:48] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Replace remaining usages of IDatabase::fetchObject() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755667 (https://phabricator.wikimedia.org/T299471)
[12:35:58] <Lucas_WMDE>	 ^ let’s just backport this now
[12:36:21] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Replace remaining usages of IDatabase::fetchObject() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755667 (https://phabricator.wikimedia.org/T299471) (owner: 10Lucas Werkmeister (WMDE))
[12:39:14] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: refresh comments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755693
[12:39:16] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: vps: migrate create_instance_with_prefix to CommonOpts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755694
[12:39:49] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Fix deprecation warning from LinksUpdate::getImages() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755668 (https://phabricator.wikimedia.org/T299472)
[12:40:05] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix deprecation warning from LinksUpdate::getImages() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755668 (https://phabricator.wikimedia.org/T299472) (owner: 10Lucas Werkmeister (WMDE))
[12:40:35] <Lucas_WMDE>	 ^and this one too
[12:40:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: probe with http host override [puppet] - 10https://gerrit.wikimedia.org/r/755657 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[12:40:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: add probes for non-lvs services [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[12:41:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add probes for non-lvs services [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946)
[12:41:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: create_instance_with_prefix: drop k8s-specific default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755692 (owner: 10Arturo Borrero Gonzalez)
[12:46:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: create_instance_with_prefix: refresh comments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755693 (owner: 10Arturo Borrero Gonzalez)
[12:46:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: migrate create_instance_with_prefix to CommonOpts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755694 (owner: 10Arturo Borrero Gonzalez)
[12:50:29] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops, 10Documentation: Documentation updates in decom workflow - https://phabricator.wikimedia.org/T287388 (10Aklapper)
[12:51:39] <wikibugs>	 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 9 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Aklapper) Half a year later, does someone plan to pick up https://gerrit.wikimedia.org/r/c/710138 , or what is left to do in this open high prio t...
[12:57:41] <wikibugs>	 (03Merged) 10jenkins-bot: Replace remaining usages of IDatabase::fetchObject() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755667 (https://phabricator.wikimedia.org/T299471) (owner: 10Lucas Werkmeister (WMDE))
[12:57:45] <Lucas_WMDE>	 yay
[12:59:09] <wikibugs>	 (03Merged) 10jenkins-bot: Fix deprecation warning from LinksUpdate::getImages() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755668 (https://phabricator.wikimedia.org/T299472) (owner: 10Lucas Werkmeister (WMDE))
[13:00:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/includes/: Backport: [[gerrit:755667|Replace remaining usages of IDatabase::fetchObject() (T299471)]] (1/2) (duration: 00m 56s)
[13:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:05] <stashbot>	 T299471: PHP Deprecated: Use of Wikimedia\Rdbms\DBConnRef::fetchObject was deprecated in MediaWiki 1.37. [Called from SpecialRandomPage::selectRandomPageFromDB] - https://phabricator.wikimedia.org/T299471
[13:01:07] <Lucas_WMDE>	 I’ll slightly overrun the window to finish these backports
[13:01:09] <Lucas_WMDE>	 jouncebot: now
[13:01:09] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 58 minute(s)
[13:01:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/maintenance/: Backport: [[gerrit:755667|Replace remaining usages of IDatabase::fetchObject() (T299471)]] (2/2) (duration: 00m 50s)
[13:01:13] <Lucas_WMDE>	 nothing else going on at least
[13:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[13:01:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[13:02:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[13:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/includes/deferred/LinksUpdate/LinksUpdate.php: Backport: [[gerrit:755668|Fix deprecation warning from LinksUpdate::getImages() (T299472)]] (duration: 00m 50s)
[13:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:01] <stashbot>	 T299472: PHP Deprecated: Use of MediaWiki\Deferred\LinksUpdate\LinksUpdate::$mImages was deprecated in MediaWiki 1.38. [Called from MediaWiki\Extension\GlobalUsage\Hooks::onLinksUpdateComplete] - https://phabricator.wikimedia.org/T299472
[13:03:06] <Lucas_WMDE>	 !log UTC morning backport window done
[13:03:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:22] <Lucas_WMDE>	 (I’ll watch the error log for a few more minutes, the deprecation volume should go down dramatically)
[13:03:41] <wikibugs>	 (03PS1) 10JMeybohm: Fix nrpe_check_disk_options hiera key for kubernetes staging masters [puppet] - 10https://gerrit.wikimedia.org/r/755698 (https://phabricator.wikimedia.org/T290967)
[13:03:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[13:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10Patch-For-Review: add traceroute measurements to RIPE Atlas prometheus data - https://phabricator.wikimedia.org/T251156 (10ayounsi) 05Open→03Resolved a:05CDanis→03ayounsi This is done, opened T299640 for further improvements.
[13:04:25] <wikibugs>	 10SRE, 10observability: Add RIPE atlas data to Prometheus - https://phabricator.wikimedia.org/T167689 (10ayounsi)
[13:04:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: match SNI with Host when overridden [puppet] - 10https://gerrit.wikimedia.org/r/755699 (https://phabricator.wikimedia.org/T291946)
[13:06:15] <Lucas_WMDE>	 https://i.imgur.com/b9iPByi.png
[13:06:35] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Fix nrpe_check_disk_options hiera key for kubernetes staging masters [puppet] - 10https://gerrit.wikimedia.org/r/755698 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:06:35] <Lucas_WMDE>	 much better
[13:07:17] <wikibugs>	 (03PS1) 10Elukey: Remove duplicate hiera config for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/755702
[13:07:43] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Replace remaining usages of IDatabase::fetchObject()/::numRows() [extensions/CentralNotice] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755670 (https://phabricator.wikimedia.org/T286694)
[13:07:48] <Lucas_WMDE>	 I’ll just cherry pick the last big one too
[13:08:11] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Replace remaining usages of IDatabase::fetchObject()/::numRows() [extensions/CentralNotice] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755670 (https://phabricator.wikimedia.org/T286694) (owner: 10Lucas Werkmeister (WMDE))
[13:08:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[13:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:56] <icinga-wm>	 RECOVERY - Disk space on kubestagemaster2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubestagemaster2001&var-datasource=codfw+prometheus/ops
[13:09:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[13:09:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[13:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[13:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33362/console" [puppet] - 10https://gerrit.wikimedia.org/r/755699 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[13:11:17] <wikibugs>	 (03Merged) 10jenkins-bot: Replace remaining usages of IDatabase::fetchObject()/::numRows() [extensions/CentralNotice] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755670 (https://phabricator.wikimedia.org/T286694) (owner: 10Lucas Werkmeister (WMDE))
[13:11:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Update to 6.4.5 and enable webauthn [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644
[13:11:31] <wikibugs>	 (03CR) 10Muehlenhoff: Update to 6.4.5 and enable webauthn (033 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 (owner: 10Muehlenhoff)
[13:13:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/CentralNotice/includes/: Backport: [[gerrit:755670|Replace remaining usages of IDatabase::fetchObject()/::numRows() (T286694)]] (duration: 00m 50s)
[13:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:11] <stashbot>	 T286694: Drop legacy cruft arising from introduction of ResultWrapper - https://phabricator.wikimedia.org/T286694
[13:13:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: match SNI with Host when overridden [puppet] - 10https://gerrit.wikimedia.org/r/755699 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[13:14:11] <wikibugs>	 10SRE, 10Platform Engineering, 10Wikimedia-Mailing-lists: Close / shut down public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Aklapper)
[13:15:26] <wikibugs>	 10SRE, 10Kubernetes, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Aklapper) @Joe: Do you know, by any chance? (Or have some link handy?)
[13:15:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti1025.eqiad.wmnet with reason: Change KVM setting in BIOS
[13:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti1025.eqiad.wmnet with reason: Change KVM setting in BIOS
[13:15:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet
[13:15:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[13:16:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:08] <Lucas_WMDE>	 even better now https://i.imgur.com/wwHOA7K.png
[13:17:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[13:17:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[13:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:52] <Lucas_WMDE>	 not backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Renameuser/+/755507 since that looks like it would be a very rare warning
[13:18:04] <Lucas_WMDE>	 and also the change doesn’t look as trivial as the others
[13:18:09] <Lucas_WMDE>	 so I’ll just let that roll out with the train
[13:18:24] <Lucas_WMDE>	 pretty sure I’m actually done now
[13:18:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[13:18:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:53] <zabe>	 the problem with the cn one is, that it is going to reappear next week unless it gets merged into the wmf_deploy branch
[13:19:33] <Lucas_WMDE>	 I don’t follow
[13:19:37] <Lucas_WMDE>	 what wmf_deploy branch?
[13:21:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet
[13:21:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:32] <Lucas_WMDE>	 it’s cherry-picked on wmf.18, and merged on master before the wmf.19 branch cut, shouldn’t that be enough?
[13:21:58] <Lucas_WMDE>	 o_O https://wikitech.wikimedia.org/wiki/CentralNotice#Deployment
[13:22:22] <zabe>	 CentralNotice has a 'special' pratice that they have a wmf_deploy branch. The wmf branches are cut from that branch. So everything needs to be cherry-picked from master to that branch first...
[13:22:29] <Lucas_WMDE>	 O_o
[13:22:50] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[13:24:01] <Lucas_WMDE>	 so, are we supposed to cherry-pick the patch to wmf_deploy?
[13:24:04] <Lucas_WMDE>	 or merge master into wmf_deploy?
[13:24:08] <Lucas_WMDE>	 or is someone else responsible for that?
[13:25:37] <Lucas_WMDE>	 well, I uploaded https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/755674
[13:26:05] <zabe>	 I think thats fundraising tech area? tbh I don't know what we are supposed to do.
[13:27:02] <Lucas_WMDE>	 CCed the person who uploaded most of the other recent wmf_deploy changes on Gerrit 🤷
[13:27:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[13:27:30] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[13:30:18] <wikibugs>	 (03PS1) 10DCausse: ejoseph: update ssh key [puppet] - 10https://gerrit.wikimedia.org/r/755706
[13:36:20] <wikibugs>	 (03CR) 10DCausse: "@EJoseph can you confirm that this is the SSH key you'll be using for production access?" [puppet] - 10https://gerrit.wikimedia.org/r/755706 (owner: 10DCausse)
[13:37:06] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on db1100 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:37:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[13:37:24] <icinga-wm>	 PROBLEM - Check systemd state on db1100 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:40:25] <marostegui>	 Amir1: ^ that's the host that got rebooted yesterday?
[13:40:40] <Amir1>	 yes
[13:40:44] <Amir1>	 let me depool it
[13:40:45] <wikibugs>	 (03CR) 10EJoseph: ejoseph: update ssh key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755706 (owner: 10DCausse)
[13:40:46] <marostegui>	 Amir1: so puppet is disabled there
[13:41:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Ottomata) Approved.
[13:41:02] <marostegui>	 can I enable it?
[13:41:03] <Amir1>	 it'
[13:41:10] <Amir1>	 let me first depool it
[13:41:14] <marostegui>	 no need
[13:42:00] <Amir1>	 oh okay
[13:42:04] <Amir1>	 let me abort my change
[13:42:23] <marostegui>	 so the script didn't enable puppet at the end?
[13:42:36] <marostegui>	 the recovery should arrive soon, just ran puppet
[13:43:03] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] "I confirmed the key with Emmanuel" [puppet] - 10https://gerrit.wikimedia.org/r/755706 (owner: 10DCausse)
[13:43:04] <Amir1>	 I did that manually because it just didn't get back up
[13:43:19] <Amir1>	 but I forgot to reenable puppet
[13:43:28] <marostegui>	 ah cool
[13:43:30] <marostegui>	 no problem
[13:43:45] <Amir1>	 the cookbook doesn't have "let's pick it from here" AFAIK :(
[13:45:08] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on db1100 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:45:08] <icinga-wm>	 RECOVERY - Check systemd state on db1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:34] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: support creating more than 1 instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707
[13:50:51] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/755489 (https://phabricator.wikimedia.org/T297683) (owner: 10Andrew Bogott)
[13:51:00] <moritzm>	 !log enabled hardware virtualisation in BIOS for ganeti1025 T293909
[13:51:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:05] <stashbot>	 T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909
[13:52:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti1024.eqiad.wmnet with reason: Change hw virt setting in BIOS
[13:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti1024.eqiad.wmnet with reason: Change hw virt setting in BIOS
[13:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet
[13:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:26] <wikibugs>	 (03CR) 10Jbond: role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[13:55:47] <marostegui>	 !log Power off es1022 for onsite maintenance T299123
[13:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:51] <stashbot>	 T299123: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123
[13:56:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10MoritzMuehlenhoff)
[13:57:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10MoritzMuehlenhoff)
[13:58:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) @Cmjohnson is now off. You can proceed as needed.
[14:00:13] <wikibugs>	 (03PS4) 10Jbond: P:pki::multirootca: Only override differences [puppet] - 10https://gerrit.wikimedia.org/r/755408
[14:00:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33363/console" [puppet] - 10https://gerrit.wikimedia.org/r/755408 (owner: 10Jbond)
[14:03:36] <moritzm>	 !log enabled hardware virtualisation in BIOS for ganeti1024 T283036
[14:03:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:41] <stashbot>	 T283036: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036
[14:05:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet
[14:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM for now, let's revisit once we have the group in netbox." [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 (owner: 10Muehlenhoff)
[14:06:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet
[14:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:20] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10Volans) Although not optimal, in the worse case scenario in which we will be unable to find/modify a tool to preserve empty lines, we could also co...
[14:10:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10WDoranWMF) @Dzahn Is it possible to add @MNadrofsky to the approver lists as he is the Platform Tech Director?
[14:13:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 (owner: 10Muehlenhoff)
[14:13:09] <wikibugs>	 (03CR) 10Hnowlan: admin: add Desiree Abad as approver for platform-engineering groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn)
[14:16:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet
[14:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:09] <marostegui>	 elukey: cumin
[14:17:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Add prometheus[12]00[56] to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/755708 (https://phabricator.wikimedia.org/T296199)
[14:20:53] <moritzm>	 !log enabled hardware virtualisation in BIOS for ganeti1023 T283036
[14:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:57] <stashbot>	 T283036: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036
[14:21:24] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7635623, @Legoktm wrote: > Using a known broken hash like MD5 seems wrong in what's supposed to be a security-sensitive application. S...
[14:21:28] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] analytics:refinery:job:data_purge: Add deletion for anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/753052 (https://phabricator.wikimedia.org/T298972) (owner: 10Mforns)
[14:25:09] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster
[14:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:48] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster
[14:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:20] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster
[14:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thx" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 (owner: 10Muehlenhoff)
[14:35:16] <elukey>	 marostegui: cumin cumin
[14:35:45] <marostegui>	 \o/
[14:36:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/755408 (owner: 10Jbond)
[14:43:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet
[14:43:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: add host-specific Prometheus data [puppet] - 10https://gerrit.wikimedia.org/r/755711 (https://phabricator.wikimedia.org/T296199)
[14:52:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet
[14:52:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:40] <wikibugs>	 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) 23:35 mutante: puppetmaster1001 - revoked puppet cert miscweb.discovery.wmnet; updated kube_services.crts.yaml to include static-bugzilla.wikimedia.org, removed miscweb....
[14:54:30] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update to 6.4.5 and enable webauthn [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/755644 (owner: 10Muehlenhoff)
[14:55:40] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[14:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:52] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 11s)
[14:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) @Muehlenhoff Could you comment on that? Should the structure be that it has team EMs rather than directors? And if there are multiple approv...
[14:56:34] <moritzm>	 !log enabled hardware virtualisation in BIOS for ganeti1026 T293909
[14:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:38] <stashbot>	 T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909
[14:57:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet
[14:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:54] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster
[14:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: move to a single flag to control uploads [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199)
[14:58:15] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster
[14:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10MoritzMuehlenhoff) >>! In T298124#7636613, @Dzahn wrote: > @Muehlenhoff Could you comment on that? Should the structure be that it has team EMs rat...
[15:02:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33364/console" [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[15:04:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet
[15:04:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:23] <moritzm>	 !log enabled hardware virtualisation in BIOS for ganeti1027 T293909
[15:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:26] <stashbot>	 T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909
[15:05:31] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster
[15:05:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet
[15:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:40] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster
[15:05:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: move to a single flag to control uploads [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199)
[15:08:31] <wikibugs>	 (03PS1) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713
[15:11:36] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:11:43] <logmsgbot>	 !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply on main
[15:11:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33365/console" [puppet] - 10https://gerrit.wikimedia.org/r/755658 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:44] <moritzm>	 !log enabled hardware virtualisation in BIOS for ganeti1028 T293909
[15:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33366/console" [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[15:12:48] <stashbot>	 T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909
[15:13:57] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster
[15:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:24] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster2003 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:14:35] <logmsgbot>	 !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: sync on main
[15:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet
[15:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "This change will enable uploads for the 'ext' instance, which I think is fine" [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[15:16:01] <logmsgbot>	 !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply on main
[15:16:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[15:17:12] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:17:39] <mutante>	 ^ would not touch that one, it looks category: risk-very-high (multi root CA change :)
[15:20:13] <logmsgbot>	 !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: sync on main
[15:20:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:05] <wikibugs>	 (03PS2) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713
[15:22:23] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster
[15:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:42] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10Joe)
[15:23:08] <jbond>	 sorry merged changes now
[15:23:34] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[15:23:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix test for virtualisation [cookbooks] - 10https://gerrit.wikimedia.org/r/755714
[15:23:56] <wikibugs>	 (03CR) 10Dzahn: "looks good, just nitpick that the link in the commit message to show where it was changed links back to itself" [puppet] - 10https://gerrit.wikimedia.org/r/755329 (owner: 10Hashar)
[15:24:00] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:24:13] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:25:18] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster2003 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:25:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed in upstream docs default is false" [puppet] - 10https://gerrit.wikimedia.org/r/755328 (owner: 10Hashar)
[15:27:27] <mutante>	 jbond: no problem at all, it seemed obvious that type of change might need some extra care at merge
[15:28:56] <wikibugs>	 (03Abandoned) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko)
[15:31:28] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster
[15:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:37] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster
[15:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:14] <wikibugs>	 (03PS1) 10Elukey: Set inference.discovery.wmnet to production stage [puppet] - 10https://gerrit.wikimedia.org/r/755715 (https://phabricator.wikimedia.org/T289835)
[15:37:53] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[15:42:21] <wikibugs>	 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan)
[15:42:45] <wikibugs>	 (03CR) 10Elukey: api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[15:43:30] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2017.codfw.wmnet with OS buster
[15:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron)
[15:45:48] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[15:46:02] <icinga-wm>	 RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:46:51] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[15:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:00] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s)
[15:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Jelto)
[15:57:34] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2018.codfw.wmnet with OS buster
[15:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:37] <wikibugs>	 (03PS2) 10Elukey: Set inference.discovery.wmnet to production stage [puppet] - 10https://gerrit.wikimedia.org/r/755715 (https://phabricator.wikimedia.org/T289835)
[16:00:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set inference.discovery.wmnet to production stage [puppet] - 10https://gerrit.wikimedia.org/r/755715 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[16:01:05] <Bsadowski1>	 "(Cannot access the database: Cannot access the database: Unknown database 'metawiki' (db1169) (db1169)"
[16:01:08] <Bsadowski1>	 ???
[16:01:36] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan)
[16:03:44] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan)
[16:08:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti1025 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755723
[16:11:22] <wikibugs>	 (03CR) 10Jbond: role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[16:11:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10WDoranWMF) On that basis it makes most sense to add me and Atieno(Atieno is a new EM on Platform she is setting up her phab/gerrit at the moment)....
[16:12:50] <icinga-wm>	 PROBLEM - Host ganeti1018 is DOWN: PING CRITICAL - Packet loss = 100%
[16:13:53] <jynus>	 Bsadowski1: not sure if you have already been told, but someone is already looking at that, seems like a recent issue
[16:15:17] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[16:16:00] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) yes, mgmt works via ssh but the new version doesn't allow me to access the web interface. I use that interface to do most firmware update...
[16:16:10] <wikibugs>	 (03CR) 10Jbond: role::pki::root: add the ml_serve intermediate PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[16:19:59] <icinga-wm>	 RECOVERY - Host ganeti1018 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[16:20:23] <wikibugs>	 (03PS2) 10Elukey: Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/730541 (https://phabricator.wikimedia.org/T289835)
[16:22:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Cmjohnson) @Marostegui BIOS and network Firmware updated, this should fix your issue. I will leave task open until you confirm all is well.
[16:23:09] <wikibugs>	 (03PS3) 10Elukey: Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/730541 (https://phabricator.wikimedia.org/T289835)
[16:23:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) @Muehlenhoff Thank you! makes sense  @WDoranWMF Ok, and yes, actually that would be ideal if you make a new request, thank you. Since it's n...
[16:25:04] <AndyRussG>	 Lucas_WMDE: hi! I just replied on the Gerrit change... thanks for working on that.. is deployment urgent at all, or can it wait until next week's train?
[16:25:16] <AndyRussG>	 (^ wrt CentralNotice deploy stuff)
[16:26:07] <Lucas_WMDE>	 AndyRussG: I just replied :)
[16:26:10] <Lucas_WMDE>	 nothing urgent I think
[16:26:11] <wikibugs>	 (03PS1) 10Ladsgroup: DatabaseBlock: Pass database name to getConnectionRef [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755676
[16:26:21] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] DatabaseBlock: Pass database name to getConnectionRef [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755676 (owner: 10Ladsgroup)
[16:26:27] <icinga-wm>	 PROBLEM - Host ganeti1018 is DOWN: PING CRITICAL - Packet loss = 100%
[16:27:01] <Amir1>	 jouncebot: nowandnext
[16:27:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 32 minute(s)
[16:27:01] <jouncebot>	 In 0 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1700)
[16:27:58] <AndyRussG>	 Lucas_WMDE: ah fantastic thanks so much!! :)
[16:28:05] <Lucas_WMDE>	 feel free to abandon the patch if it’s not needed :)
[16:28:41] <AndyRussG>	 Lucas_WMDE: ok thanks.. yeah I'll do a general merge of master to wmf_deploy before the next branch cut, then, apologies again for the twisted process ;p
[16:28:46] <Lucas_WMDE>	 alright :)
[16:28:53] <icinga-wm>	 PROBLEM - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname, address or socket: inference.discovery.wmnet https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:30:01] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: support creating more than 1 instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707
[16:30:06] <wikibugs>	 (03CR) 10Volans: wmcs: move grid-dedicated code to its own package (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez)
[16:31:19] <wikibugs>	 (03PS3) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713
[16:31:29] <icinga-wm>	 PROBLEM - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname, address or socket: inference.discovery.wmnet https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:35:11] <elukey>	 this is me --^
[16:35:18] <mutante>	 ACK
[16:35:32] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2018.codfw.wmnet with OS buster
[16:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:11] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet
[16:36:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:19] <icinga-wm>	 RECOVERY - Host ganeti1018 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[16:37:48] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/755714 (owner: 10Muehlenhoff)
[16:38:05] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Fix test for virtualisation [cookbooks] - 10https://gerrit.wikimedia.org/r/755714 (owner: 10Muehlenhoff)
[16:38:20] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson)
[16:38:28] <Pchelolo>	 jouncebot: nowandnext
[16:38:28] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 21 minute(s)
[16:38:28] <jouncebot>	 In 0 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1700)
[16:39:09] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff The idrac is giving me a hard time, it's not worth slowing this process down. The idrac has no bearing on your issue....
[16:39:11] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:40:08] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2019.codfw.wmnet with OS buster
[16:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:11] <icinga-wm>	 ACKNOWLEDGEMENT - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname, address or socket: inference.discovery.wmnet daniel_zahn known, will be fixed soon https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:41:11] <icinga-wm>	 ACKNOWLEDGEMENT - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname, address or socket: inference.discovery.wmnet daniel_zahn known, will be fixed soon https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:43:27] <Pchelolo>	 I'll deploy a little mw-config change if nobody minds
[16:43:33] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+2] Add temporary entrypoint for settings benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 (owner: 10Ppchelko)
[16:43:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DatabaseBlock: Pass database name to getConnectionRef [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755676 (owner: 10Ladsgroup)
[16:44:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add temporary entrypoint for settings benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 (owner: 10Ppchelko)
[16:45:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:45:49] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[16:46:59] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Make Block objects aware of which wiki they belong to" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755678
[16:47:27] <wikibugs>	 (03Abandoned) 10Ladsgroup: DatabaseBlock: Pass database name to getConnectionRef [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755676 (owner: 10Ladsgroup)
[16:47:35] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Make Block objects aware of which wiki they belong to" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755678 (owner: 10Ladsgroup)
[16:47:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:47:59] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[16:48:01] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2019.codfw.wmnet with OS buster
[16:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:25] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2019.codfw.wmnet with OS buster
[16:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) Also, let's incorporate @hnowlan's comments on https://gerrit.wikimedia.org/r/c/operations/puppet/+/755500/1/modules/admin/data/data.yaml in...
[16:50:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[16:50:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:46] <logmsgbot>	 !log ppchelko@deploy1002 Synchronized w/tmp_settings_bench.php: Config: gerrit 755399 add temporary entrypoint for settings benchmark (duration: 00m 50s)
[16:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[16:51:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[16:51:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:53] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan)
[16:52:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] admin: add Desiree Abad as approver for platform-engineering groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn)
[16:52:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Cmjohnson) a:05Cmjohnson→03Marostegui
[16:52:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[16:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:41] <wikibugs>	 (03Abandoned) 10Dzahn: admin: add Desiree Abad as approver for platform-engineering groups [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn)
[16:55:13] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2019.codfw.wmnet with OS buster
[16:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:50] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2020.codfw.wmnet with OS buster
[16:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) 05Open→03Resolved per above (we can link the new ticket here once it's created)
[17:00:05] <jouncebot>	 jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:30] <rzl>	 ✅
[17:00:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/730541 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[17:01:19] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[17:01:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install labstore100[89] - https://phabricator.wikimedia.org/T299610 (10Andrew) clouddumps100x or  clouddatasets100x or just datasets100x
[17:01:55] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan)
[17:01:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Prod-Kubernetes, and 3 others: decommission kubestage100[12]-eqiad - https://phabricator.wikimedia.org/T299142 (10Cmjohnson)
[17:03:09] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2020.codfw.wmnet with OS buster
[17:03:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:40] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=inference
[17:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Prod-Kubernetes, and 3 others: decommission kubestage100[12]-eqiad - https://phabricator.wikimedia.org/T299142 (10Cmjohnson) 05Open→03Resolved
[17:05:32] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:05:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:51] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2021.codfw.wmnet with OS buster
[17:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:24] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/755681
[17:08:32] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS buster
[17:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster
[17:09:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Make Block objects aware of which wiki they belong to" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755678 (owner: 10Ladsgroup)
[17:13:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[17:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[17:14:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[17:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:03] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1008.eqiad.wmnet with OS buster
[17:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster executed with errors: -...
[17:15:22] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS buster
[17:15:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster
[17:15:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[17:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:04] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.18/includes/: Backport: [[gerrit:755678|Revert "Make Block objects aware of which wiki they belong to"]] (duration: 00m 55s)
[17:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:30] <Amir1>	 we might a flood of errors now
[17:18:36] <Amir1>	 but it should recover
[17:21:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "The diff this creates does not look right, but I've no idea why. Have not looked in detail, though" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris)
[17:24:34] <wikibugs>	 (03PS16) 10Brennen Bearnes: gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978)
[17:27:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolforge grid engine: install fdm [puppet] - 10https://gerrit.wikimedia.org/r/755489 (https://phabricator.wikimedia.org/T297683) (owner: 10Andrew Bogott)
[17:27:57] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH)
[17:28:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1008.eqiad.wmnet with OS buster
[17:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster executed with errors: -...
[17:28:22] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH)
[17:34:09] <icinga-wm>	 RECOVERY - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is OK: TCP OK - 0.007 second response time on inference.discovery.wmnet port 30443 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:34:41] <elukey>	 \o/
[17:34:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[17:35:27] <jayme>	 elukey: uuuh, new cert? :)
[17:35:56] <elukey>	 jayme: I added the discovery endpoint (finally :)
[17:36:17] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) So we think this may work, and we've ordered 2 hosts via T297151 for use and testing.
[17:36:25] <jayme>	 elukey: ah, I though you had that for quite some time already
[17:39:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) Thanks Chris - I will try a reimage on Monday to see if it PXE boots fine. I have started mysql now so it can start catching up
[17:39:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[17:43:03] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2021.codfw.wmnet with OS buster
[17:43:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:00] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2021.codfw.wmnet
[17:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:10] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2022.codfw.wmnet with OS buster
[17:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:22] <wikibugs>	 (03PS1) 10Ppchelko: Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741
[17:45:51] <icinga-wm>	 RECOVERY - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is OK: TCP OK - 0.009 second response time on inference.discovery.wmnet port 30443 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:47:19] <wikibugs>	 (03PS2) 10Ppchelko: Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741
[17:47:43] <wikibugs>	 (03PS4) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713
[17:48:43] <wikibugs>	 (03PS4) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584)
[17:49:06] <hashar>	 I am rebalancing partitions on the CI agent https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Dpuppet%2Ddocker%2D1002/
[17:49:16] <hashar>	 patches to operations/puppet will be a bit delayed
[17:54:30] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM overall, please see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/754520 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond)
[17:55:32] <wikibugs>	 (03CR) 10Herron: [C: 03+1] P:rsyslog: add squid to the list of programs sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond)
[17:55:35] <hashar>	 CI agent is back online
[17:57:30] <wikibugs>	 (03PS2) 10Dzahn: Revert "Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/755681 (https://phabricator.wikimedia.org/T281538)
[17:57:55] <wikibugs>	 (03PS5) 10Hashar: ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713 (https://phabricator.wikimedia.org/T292729)
[17:59:30] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Add prometheus[12]00[56] to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/755708 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[17:59:32] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Attached to T292729 which is the real reason for this puppet change: raise /srv disk space from 18G to now 37G." [puppet] - 10https://gerrit.wikimedia.org/r/755713 (https://phabricator.wikimedia.org/T292729) (owner: 10Hashar)
[18:00:00] <wikibugs>	 (03CR) 10Herron: [C: 03+1] hieradata: add host-specific Prometheus data [puppet] - 10https://gerrit.wikimedia.org/r/755711 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[18:00:05] <jouncebot>	 chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1800).
[18:04:18] <wikibugs>	 (03CR) 10Herron: [C: 03+1] thanos: move to a single flag to control uploads [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[18:04:45] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prepare for logstash 7.16.3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755041 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite)
[18:05:04] <wikibugs>	 (03CR) 10Herron: [C: 03+1] bump patch version to update plugins [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755033 (owner: 10Cwhite)
[18:08:15] <wikibugs>	 (03PS1) 10Majavah: Do not try to make watchlist collapsible on wikis where watchlist is disabled [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755682 (https://phabricator.wikimedia.org/T299671)
[18:08:41] <taavi>	 jouncebot: nowandnext
[18:08:41] <jouncebot>	 For the next 0 hour(s) and 51 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1800)
[18:08:41] <jouncebot>	 In 0 hour(s) and 51 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1900)
[18:08:59] <taavi>	 I'm boldly going to deploy that Vector backport
[18:09:16] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Do not try to make watchlist collapsible on wikis where watchlist is disabled [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755682 (https://phabricator.wikimedia.org/T299671) (owner: 10Majavah)
[18:10:55] <wikibugs>	 (03CR) 10Cicalese: [C: 03+1] Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741 (owner: 10Ppchelko)
[18:13:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/755681 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[18:17:11] <mutante>	 !log running puppet on cp403*
[18:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:20] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:22:42] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2022.codfw.wmnet with OS buster
[18:22:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:56] <wikibugs>	 (03PS1) 10Clare Ming: Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555)
[18:23:05] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2022.codfw.wmnet
[18:23:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:35] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+2] Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741 (owner: 10Ppchelko)
[18:24:16] <wikibugs>	 (03Merged) 10jenkins-bot: Temp settings benchmarking entrypoint enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755741 (owner: 10Ppchelko)
[18:25:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Do not try to make watchlist collapsible on wikis where watchlist is disabled [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755682 (https://phabricator.wikimedia.org/T299671) (owner: 10Majavah)
[18:25:35] <taavi>	 ://
[18:26:24] <wikibugs>	 (03Merged) 10jenkins-bot: Do not try to make watchlist collapsible on wikis where watchlist is disabled [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755682 (https://phabricator.wikimedia.org/T299671) (owner: 10Majavah)
[18:26:51] <taavi>	 Pchelolo: I am already deploying, can you wait a bit with your config patch?
[18:27:01] <Pchelolo>	 taavi: 
[18:27:06] <Pchelolo>	 oh damn, sorry
[18:27:34] <logmsgbot>	 !log ppchelko@deploy1002 Synchronized w/tmp_settings_bench.php: Config: gerrit 755741 enhancements for the settings benchmark entrypoint (duration: 00m 51s)
[18:27:35] <Pchelolo>	 it already finished
[18:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:52] <taavi>	 ah, continuing with my backport then
[18:27:59] <Pchelolo>	 not touching anything else anymore.
[18:28:13] <taavi>	 thanks
[18:29:54] <logmsgbot>	 !log taavi@deploy1002 Synchronized php-1.38.0-wmf.18/skins/Vector/includes/Hooks.php: Backport: [[gerrit:755682|Do not try to make watchlist collapsible on wikis where watchlist is disabled (T299671)]] (duration: 00m 50s)
[18:29:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:59] <stashbot>	 T299671: Loginwiki fatals (TypeError: Argument 1 passed to Vector\Hooks::makeMenuItemCollapsible() must be of the type array, null given, called in /srv/mediawiki/php-1.38.0-wmf.18/skins/Vector/includes/Hooks.php on line 226) - https://phabricator.wikimedia.org/T299671
[18:30:10] * taavi done
[18:31:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[18:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:30] <wikibugs>	 (03PS2) 10Clare Ming: Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555)
[18:32:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[18:32:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[18:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[18:32:48] <wikibugs>	 (03PS9) 10Andrew Bogott: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[18:33:03] <wikibugs>	 (03PS1) 10Dzahn: add a foot note to the index.html that this is now a Kubernetes service [container/miscweb] - 10https://gerrit.wikimedia.org/r/755748 (https://phabricator.wikimedia.org/T281538)
[18:33:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add a foot note to the index.html that this is now a Kubernetes service [container/miscweb] - 10https://gerrit.wikimedia.org/r/755748 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[18:33:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[18:33:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:35] <wikibugs>	 (03Merged) 10jenkins-bot: add a foot note to the index.html that this is now a Kubernetes service [container/miscweb] - 10https://gerrit.wikimedia.org/r/755748 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[18:38:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[18:38:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[18:40:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[18:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[18:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:00] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:48:25] <wikibugs>	 (03PS1) 10EJoseph: Upgrade to elasticsearh 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499)
[18:50:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:50:53] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) (owner: 10Clare Ming)
[18:51:40] <wikibugs>	 (03PS1) 10Dzahn: miscweb: bump version to 2022-01-20-183807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/755751 (https://phabricator.wikimedia.org/T281538)
[18:51:52] <wikibugs>	 (03CR) 10Nray: [C: 03+1] Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) (owner: 10Clare Ming)
[18:52:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: bump version to 2022-01-20-183807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/755751 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[18:52:58] <MatmaRex>	 dear train conductor (jeena?): i think https://phabricator.wikimedia.org/T299583 doesn't block the train, unless the log spam is too much
[18:53:06] <MatmaRex>	 i commented there
[18:54:07] <jeena>	 Thanks MatmaRex !
[18:55:59] <wikibugs>	 10SRE, 10Foundational Technology Requests, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10RBrounley_WMF) 05In progress→03Resolved
[18:56:01] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bump version to 2022-01-20-183807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/755751 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[19:00:04] <jouncebot>	 RoanKattouw and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1900).
[19:00:04] <jouncebot>	 Juan_90264 and cjming: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:17] <cjming>	 o/
[19:01:18] <urbanecm>	 hello!
[19:01:25] <urbanecm>	 cjming: hi, want to deploy today?
[19:01:58] <cjming>	 sure
[19:02:16] <urbanecm>	 go ahead then :)
[19:02:29] <cjming>	 urbanecm: would you do the 1st one if no one else has approved?
[19:02:40] <cjming>	 meaning is it ok to go for it?
[19:02:49] <urbanecm>	 cjming: Juan's not around, so it should be skipped
[19:03:05] <cjming>	 cool - onward then
[19:03:13] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) (owner: 10Clare Ming)
[19:04:04] <wikibugs>	 (03Merged) 10jenkins-bot: Disable language alert for pilot wikis except thwiki, viwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755745 (https://phabricator.wikimedia.org/T295555) (owner: 10Clare Ming)
[19:06:01] <cjming>	 lgtm - syncing
[19:06:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:18] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755745|Disable language alert for pilot wikis except thwiki, viwiki. (T295555)]] (duration: 00m 51s)
[19:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:21] <stashbot>	 T295555: Language switching: put an alert in the sidebar about where the language links are - https://phabricator.wikimedia.org/T295555
[19:07:59] <cjming>	 alrighty - my change is live -- shall I close this B&C window then?
[19:08:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:08:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:44] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main
[19:08:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:13] <wikibugs>	 (03PS1) 10Andrew Bogott: Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546)
[19:09:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:07] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main
[19:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:36] <logmsgbot>	 !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply on main
[19:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:03] <cjming>	 !log end of UTC evening backport & config window
[19:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:04] <logmsgbot>	 !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: sync on main
[19:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:23] <logmsgbot>	 !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply on main
[19:14:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:04] <wikibugs>	 (03PS1) 10Volans: spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756
[19:15:14] <wikibugs>	 (03PS2) 10Andrew Bogott: Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546)
[19:15:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:15:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:15:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:16:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:26] <wikibugs>	 (03PS3) 10Andrew Bogott: Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546)
[19:17:29] <logmsgbot>	 !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: sync on main
[19:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:30] <jhathaway>	 !log rebooting mx1001 to test new kernel
[19:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:11] <wikibugs>	 (03PS4) 10Andrew Bogott: Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546)
[19:22:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans)
[19:23:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Define profile::openstack::eqiad1::cinder::backup::nodes [puppet] - 10https://gerrit.wikimedia.org/r/755753 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott)
[19:30:33] <bd808>	 jouncebot: now
[19:30:33] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T1900)
[19:31:30] <wikibugs>	 (03PS1) 10Andrew Bogott: Provide cinder backup node list to rabbitmq in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/755759 (https://phabricator.wikimedia.org/T292546)
[19:33:48] <wikibugs>	 (03PS2) 10BryanDavis: wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185
[19:34:14] <wikibugs>	 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) 05Open→03Resolved This is resolved! :)  Proof is the footnote in https://static-bugzilla.wikimedia.org/ that is only shown when served from k8s.  {F34924843}
[19:34:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Provide cinder backup node list to rabbitmq in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/755759 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott)
[19:35:08] <wikibugs>	 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10FGoodwin)
[19:35:25] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185 (owner: 10BryanDavis)
[19:36:27] <wikibugs>	 (03Merged) 10jenkins-bot: wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185 (owner: 10BryanDavis)
[19:38:27] <logmsgbot>	 !log bd808@deploy1002 Synchronized wmf-config/wikitech.php: wikitech: Remove password clear on block (duration: 00m 50s)
[19:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:02] <icinga-wm>	 RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:40:28] <wikibugs>	 (03CR) 10Volans: "CI Failures are due to the latest dnspython 2.2.0 release 2 days ago." [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans)
[19:42:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:43:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:21] <wikibugs>	 (03PS2) 10Aaron Schulz: Simplify comments and stubs for etcd-defined DB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752212
[19:49:00] <wikibugs>	 (03PS1) 10Dzahn: delete bugzilla_static after it moved from puppet to k8s [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538)
[20:00:04] <jouncebot>	 jeena and twentyafterfour: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220120T2000)
[20:02:58] <jeena>	 MatmaRex: If https://phabricator.wikimedia.org/T299583 is resolved I'd like to backport it before doing the train
[20:09:32] <MatmaRex>	 jeena: yes, please do
[20:10:02] <MatmaRex>	 jeena: i'm afk for a moment, i'll be back in 30 minutes, but i don't think you'll need me for this?
[20:10:14] <MatmaRex>	 thanks and sorry about the bug :)
[20:10:15] <jeena>	 are there 3 patches I need to backport?
[20:10:17] <jeena>	 or just the one?
[20:10:36] <MatmaRex>	 just one, i think?
[20:10:54] <MatmaRex>	 which would be the other ones?
[20:11:11] <jeena>	 ah okay, I thought there were more from the comments on your commit message
[20:11:17] <jeena>	 thanks!
[20:11:19] <MatmaRex>	 oh, the two i mentioned there are already in wmf.18
[20:11:25] <jeena>	 okay cool
[20:11:27] <MatmaRex>	 and they're the cause of this bug
[20:11:28] <MatmaRex>	 :D
[20:11:33] <jeena>	 haha
[20:11:47] <MatmaRex>	 brb
[20:13:51] <wikibugs>	 (03PS1) 10Jeena Huneidi: Prevent assertion failure caused by empty headings [extensions/DiscussionTools] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755684 (https://phabricator.wikimedia.org/T299583)
[20:16:23] <wikibugs>	 (03PS1) 10RLazarus: Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764
[20:17:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (owner: 10RLazarus)
[20:19:12] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Prevent assertion failure caused by empty headings [extensions/DiscussionTools] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755684 (https://phabricator.wikimedia.org/T299583) (owner: 10Jeena Huneidi)
[20:19:21] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [extensions/DiscussionTools] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755684 (https://phabricator.wikimedia.org/T299583) (owner: 10Jeena Huneidi)
[20:24:01] <wikibugs>	 (03Merged) 10jenkins-bot: Prevent assertion failure caused by empty headings [extensions/DiscussionTools] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755684 (https://phabricator.wikimedia.org/T299583) (owner: 10Jeena Huneidi)
[20:25:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[20:26:40] <wikibugs>	 (03CR) 10Umherirrender: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/755767 (https://phabricator.wikimedia.org/T282308) (owner: 10Umherirrender)
[20:30:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[20:30:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[20:31:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[20:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:27] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM in functional terms.  Probably needs some confirmation from data eng that they're ready to have the new data appear in the webrequest" [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx)
[20:31:31] <logmsgbot>	 !log jhuneidi@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/DiscussionTools/includes/HeadingItem.php: Backport: [[gerrit:755684|Prevent assertion failure caused by empty headings (T299583)]] (duration: 00m 50s)
[20:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:34] <stashbot>	 T299583: Wikimedia\Assert\PreconditionException: Precondition failed: Range is not collapsed - https://phabricator.wikimedia.org/T299583
[20:32:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[20:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:56] <wikibugs>	 (03PS1) 10Jeena Huneidi: all wikis to 1.38.0-wmf.18  refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755787
[20:33:58] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.38.0-wmf.18  refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755787 (owner: 10Jeena Huneidi)
[20:34:38] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS buster
[20:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster
[20:34:55] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.18  refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755787 (owner: 10Jeena Huneidi)
[20:35:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[20:36:10] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.18  refs T293959
[20:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:13] <stashbot>	 T293959: 1.38.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T293959
[20:37:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[20:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:45] <urandom>	 !log upgrading Cassandra to 3.11.11, aqs1010 -- T298516
[20:37:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:49] <stashbot>	 T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516
[20:38:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[20:38:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[20:38:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:11] <wikibugs>	 10SRE, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[20:40:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[20:40:13] <wikibugs>	 10SRE, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[20:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:34] <wikibugs>	 (03PS3) 10Herron: remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T299700)
[20:41:25] <MatmaRex>	 thanks for backporting
[20:41:31] <wikibugs>	 (03PS1) 10Andrew Bogott: ceph: list cloudbackup2002 as a cinder backup node [puppet] - 10https://gerrit.wikimedia.org/r/755788 (https://phabricator.wikimedia.org/T292546)
[20:41:41] <jeena>	 thanks for the fix :)
[20:42:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ceph: list cloudbackup2002 as a cinder backup node [puppet] - 10https://gerrit.wikimedia.org/r/755788 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott)
[20:44:14] <wikibugs>	 (03PS2) 10Andrew Bogott: ceph: list cloudbackup2002 as a cinder backup node [puppet] - 10https://gerrit.wikimedia.org/r/755788 (https://phabricator.wikimedia.org/T292546)
[20:45:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] ceph: list cloudbackup2002 as a cinder backup node [puppet] - 10https://gerrit.wikimedia.org/r/755788 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott)
[20:48:04] <wikibugs>	 (03PS1) 10Herron: switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700)
[20:49:04] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[20:50:59] <wikibugs>	 (03PS1) 10Herron: remove kibana.discovery.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/755790 (https://phabricator.wikimedia.org/T299700)
[20:51:24] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[20:54:40] <wikibugs>	 (03PS1) 10Eigyan: [wmf-config]: Deploy fawiki test survey to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628)
[20:58:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[21:00:04] <wikibugs>	 10SRE, 10serviceops: Debian package for httpbb - https://phabricator.wikimedia.org/T299705 (10RLazarus) p:05Triage→03Medium
[21:00:38] <wikibugs>	 10SRE, 10serviceops: Debian package for httpbb - https://phabricator.wikimedia.org/T299705 (10RLazarus)
[21:00:43] <wikibugs>	 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10RLazarus)
[21:01:01] <wikibugs>	 (03PS2) 10RLazarus: Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705)
[21:01:41] <wikibugs>	 (03PS2) 10Eigyan: [wmf-config]: Deploy fawiki test survey to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628)
[21:02:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[21:04:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1008.eqiad.wmnet with OS buster
[21:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster executed with errors: -...
[21:06:23] <wikibugs>	 (03PS1) 10Addshore: Add mwcli.command_execute to wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755794 (https://phabricator.wikimedia.org/T293583)
[21:09:32] <wikibugs>	 (03PS3) 10RLazarus: Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705)
[21:09:34] <wikibugs>	 (03PS1) 10RLazarus: tox: Run mypy only in the source directory and exclude .eggs from flake8 [software/httpbb] - 10https://gerrit.wikimedia.org/r/755796
[21:28:39] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] "verify_commit is happy, builds a deb. Installed elasticsearch-oss 6.8.23 along with this package to cirrus-integ02,loads up happy enough." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) (owner: 10EJoseph)
[21:31:01] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Add mwcli.command_execute to wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755794 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore)
[21:45:51] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] Upgrade to elasticsearh 6.8.23 (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) (owner: 10EJoseph)
[21:50:14] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:50:43] <wikibugs>	 (03PS1) 10Eevans: Pin Cassandra 3.11.11 as 'dev' [puppet] - 10https://gerrit.wikimedia.org/r/755800 (https://phabricator.wikimedia.org/T298516)
[21:53:32] <icinga-wm>	 PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:56:40] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "PCC output: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33372/console" [puppet] - 10https://gerrit.wikimedia.org/r/755800 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans)
[21:59:47] <urandom>	 Puppet is in WARN on aqs1010, if there is anyone around that can +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/755800, we could resolve that
[22:00:08] <urandom>	 So...is there? :)
[22:05:15] <cwhite>	 o/
[22:05:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/755800 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans)
[22:06:00] <cwhite>	 urandom: done
[22:06:18] <urandom>	 cwhite: awesome; thanks!
[22:15:03] <wikibugs>	 (03PS1) 10Ryan Kemper: wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/755806 (https://phabricator.wikimedia.org/T282117)
[22:15:29] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: ensure dlq directory exists [puppet] - 10https://gerrit.wikimedia.org/r/753571 (owner: 10Cwhite)
[22:17:11] <wikibugs>	 (03CR) 10Bking: [V: 03+1] wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/755806 (https://phabricator.wikimedia.org/T282117) (owner: 10Ryan Kemper)
[22:17:22] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] bump patch version to update plugins [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755033 (owner: 10Cwhite)
[22:22:36] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] builder: add opensearch1 pbuilder hooks for logstash-plugins update [puppet] - 10https://gerrit.wikimedia.org/r/755043 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite)
[22:26:30] <wikibugs>	 (03CR) 10Ryan Kemper: "We will merge this when we're ready to go from monitoring_setup to production. Currently we're in lvs_setup going into monitoring_setup so" [dns] - 10https://gerrit.wikimedia.org/r/755806 (https://phabricator.wikimedia.org/T282117) (owner: 10Ryan Kemper)
[22:27:08] <urandom>	 !log rolling restart of Cassandra, aqs-next -- T298516
[22:27:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:14] <stashbot>	 T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516
[22:33:13] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10tstarling) >>! In T292322#7636542, @Joe wrote: > But given in reality I was proposing to do something like: >  > signature = md5sum( secret + padding + re...
[22:35:22] <wikibugs>	 (03PS1) 10Bking: wcqs: Move back from lvs_setup to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/755810 (https://phabricator.wikimedia.org/T280001)
[22:36:19] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] wcqs: Move back from lvs_setup to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/755810 (https://phabricator.wikimedia.org/T280001) (owner: 10Bking)
[22:36:41] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wcqs: Move back from lvs_setup to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/755810 (https://phabricator.wikimedia.org/T280001) (owner: 10Bking)
[22:38:36] <inflatador>	 !log running puppet-merge for ^^
[22:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:20] <inflatador>	 !log running puppet-merge for https://gerrit.wikimedia.org/r/755810
[22:40:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:01] <wikibugs>	 (03PS1) 10Cwhite: logstash: install logstash-plugins on logging logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/755811 (https://phabricator.wikimedia.org/T299168)
[22:51:32] <wikibugs>	 (03PS1) 10Cwhite: logstash: switch to opensearch output plugin on production logstash [puppet] - 10https://gerrit.wikimedia.org/r/755812 (https://phabricator.wikimedia.org/T299168)
[22:53:27] <icinga-wm>	 RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:57:35] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Undeploy UserMerge (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755532 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah)
[23:05:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr)
[23:51:29] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:58:11] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 875942 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops