[00:00:05] <jouncebot>	 brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220121T0000).
[00:00:29] <brennen>	 o/
[00:00:47] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:02:48] <brennen>	 no one for training to day and no patches in the queue; calling it.
[00:02:59] <thcipriani>	 +1
[00:04:34] <wikibugs>	 (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628) (owner: 10Eigyan)
[00:26:41] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_network_flows_internal-sanitization_daily.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:16] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond)
[00:33:38] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[00:33:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] remove kibana.discovery.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/755790 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[00:34:59] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[00:35:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] hieradata: add host-specific Prometheus data [puppet] - 10https://gerrit.wikimedia.org/r/755711 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[00:36:29] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Add prometheus[12]00[56] to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/755708 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[01:35:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:40:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:44:41] <icinga-wm>	 RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[02:13:19] <wikibugs>	 (03PS1) 10Scardenasmolinar: Lower The Wikipedia Library editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755834 (https://phabricator.wikimedia.org/T288070)
[03:33:17] <icinga-wm>	 PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:13:17] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 58.04 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:15:37] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 87.1 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[05:38:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755769
[05:41:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755769 (owner: 10Marostegui)
[05:42:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18950 and previous config saved to /var/cache/conftool/dbconfig/20220121-054228-root.json
[05:42:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:47] <wikibugs>	 (03PS1) 10Marostegui: es2030,es2032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755850 (https://phabricator.wikimedia.org/T299741)
[05:48:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2030,es2032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755850 (https://phabricator.wikimedia.org/T299741) (owner: 10Marostegui)
[05:49:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2030.codfw.wmnet with OS bullseye
[05:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:57:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18951 and previous config saved to /var/cache/conftool/dbconfig/20220121-055732-root.json
[05:57:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18952 and previous config saved to /var/cache/conftool/dbconfig/20220121-061235-root.json
[06:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:46] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2030,es2032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755770
[06:19:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2030.codfw.wmnet with OS bullseye
[06:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2030,es2032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755770 (owner: 10Marostegui)
[06:21:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2032 to es1 master T299741', diff saved to https://phabricator.wikimedia.org/P18953 and previous config saved to /var/cache/conftool/dbconfig/20220121-062116-marostegui.json
[06:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:21:20] <stashbot>	 T299741: Upgrade es1 to Bullseye - https://phabricator.wikimedia.org/T299741
[06:22:54] <wikibugs>	 (03PS1) 10Marostegui: es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755853 (https://phabricator.wikimedia.org/T299741)
[06:23:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755853 (https://phabricator.wikimedia.org/T299741) (owner: 10Marostegui)
[06:24:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bullseye
[06:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 20%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18954 and previous config saved to /var/cache/conftool/dbconfig/20220121-062739-root.json
[06:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18955 and previous config saved to /var/cache/conftool/dbconfig/20220121-064243-root.json
[06:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bullseye
[06:54:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:36] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755771
[06:56:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755771 (owner: 10Marostegui)
[06:57:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 40%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18956 and previous config saved to /var/cache/conftool/dbconfig/20220121-065746-root.json
[06:57:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032 for reimage T299741', diff saved to https://phabricator.wikimedia.org/P18957 and previous config saved to /var/cache/conftool/dbconfig/20220121-065854-marostegui.json
[06:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:58] <stashbot>	 T299741: Upgrade es1 to Bullseye - https://phabricator.wikimedia.org/T299741
[07:00:02] <wikibugs>	 (03PS1) 10Marostegui: es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755856 (https://phabricator.wikimedia.org/T299741)
[07:01:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755856 (https://phabricator.wikimedia.org/T299741) (owner: 10Marostegui)
[07:04:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1032.eqiad.wmnet with OS bullseye
[07:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18958 and previous config saved to /var/cache/conftool/dbconfig/20220121-071250-root.json
[07:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:57] <elukey>	 !log systemctl reset-failed session-3.scope on an-test-client1001 (failed, transient unit)
[07:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:14] <elukey>	 !log elukey@build2001:~$ sudo systemctl reset-failed ifup@ens13.service
[07:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:28] <elukey>	 !log elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics.service
[07:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 60%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18959 and previous config saved to /var/cache/conftool/dbconfig/20220121-072754-root.json
[07:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:12] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755772
[07:30:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1032.eqiad.wmnet with OS bullseye
[07:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18960 and previous config saved to /var/cache/conftool/dbconfig/20220121-073051-root.json
[07:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755772 (owner: 10Marostegui)
[07:32:33] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:35:01] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[07:35:04] <wikibugs>	 (03PS1) 10Marostegui: core_multiinstance.my.cnf.erb: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/755915 (https://phabricator.wikimedia.org/T287244)
[07:36:37] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[07:36:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] core_multiinstance.my.cnf.erb: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/755915 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui)
[07:42:09] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18961 and previous config saved to /var/cache/conftool/dbconfig/20220121-074257-root.json
[07:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18962 and previous config saved to /var/cache/conftool/dbconfig/20220121-074555-root.json
[07:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:21] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755773
[07:52:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755773 (owner: 10Marostegui)
[07:58:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18963 and previous config saved to /var/cache/conftool/dbconfig/20220121-075801-root.json
[07:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220121T0800)
[08:00:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18964 and previous config saved to /var/cache/conftool/dbconfig/20220121-080058-root.json
[08:01:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:16:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18965 and previous config saved to /var/cache/conftool/dbconfig/20220121-081602-root.json
[08:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:18] <icinga-wm>	 PROBLEM - carbon-cache@h service on cloudmetrics1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:17:24] <icinga-wm>	 PROBLEM - carbon-cache@c service on cloudmetrics1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:17:38] <icinga-wm>	 PROBLEM - carbon-local-relay service on cloudmetrics1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:19:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:19:24] <icinga-wm>	 RECOVERY - carbon-cache@h service on cloudmetrics1004 is OK: OK - carbon-cache@h is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:19:30] <icinga-wm>	 RECOVERY - carbon-cache@c service on cloudmetrics1004 is OK: OK - carbon-cache@c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:19:44] <icinga-wm>	 RECOVERY - carbon-local-relay service on cloudmetrics1004 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:27:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet
[08:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18966 and previous config saved to /var/cache/conftool/dbconfig/20220121-083106-root.json
[08:31:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet
[08:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1018.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[08:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1018.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[08:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:42:13] <wikibugs>	 (03PS1) 10Vgutierrez: site: Reimage cp3063 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/755917 (https://phabricator.wikimedia.org/T271421)
[08:42:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: bump open files limit for blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/755918 (https://phabricator.wikimedia.org/T296199)
[08:44:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: bump open files limit for blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/755918 (https://phabricator.wikimedia.org/T296199)
[08:46:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18967 and previous config saved to /var/cache/conftool/dbconfig/20220121-084609-root.json
[08:46:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump open files limit for blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/755918 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[08:48:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:52:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: switch to opensearch output plugin on production logstash [puppet] - 10https://gerrit.wikimedia.org/r/755812 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite)
[08:53:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: install logstash-plugins on logging logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/755811 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite)
[08:53:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[08:53:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] remove kibana.discovery.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/755790 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[08:54:00] <wikibugs>	 (03PS2) 10ArielGlenn: update wme html dumps downloader to use JWT auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/755345 (https://phabricator.wikimedia.org/T273585)
[08:54:40] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:55:43] <wikibugs>	 (03PS1) 10JMeybohm: Remove the hacks to around masquerade-all [deployment-charts] - 10https://gerrit.wikimedia.org/r/755920 (https://phabricator.wikimedia.org/T290967)
[08:56:04] <wikibugs>	 (03PS2) 10JMeybohm: Remove the hacks around masquerade-all [deployment-charts] - 10https://gerrit.wikimedia.org/r/755920 (https://phabricator.wikimedia.org/T290967)
[08:56:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:57:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:58:50] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:00:35] <vgutierrez>	 !log depool cp3063 to be reimaged as cache::upload_envoy - T271421
[09:00:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:39] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[09:01:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18968 and previous config saved to /var/cache/conftool/dbconfig/20220121-090113-root.json
[09:01:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3063 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/755917 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:03:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:04:18] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3063.esams.wmnet with OS buster
[09:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster
[09:04:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:05:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T296199)
[09:06:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] "Just FYI - this will produce a diff on ml as well as the staging-codfw node IP's where hardcoded" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755920 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[09:06:33] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[09:06:35] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[09:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:50] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[09:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:10:30] <wikibugs>	 (03Merged) 10jenkins-bot: Remove the hacks around masquerade-all [deployment-charts] - 10https://gerrit.wikimedia.org/r/755920 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[09:11:04] <icinga-wm>	 RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops
[09:11:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:11:36] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:11:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:38] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10aborrero)
[09:13:15] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10aborrero)
[09:13:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10aborrero)
[09:13:40] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10aborrero)
[09:13:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10aborrero)
[09:16:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18969 and previous config saved to /var/cache/conftool/dbconfig/20220121-091617-root.json
[09:16:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:34] <wikibugs>	 (03PS1) 10JMeybohm: Add master IPs to main/wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755924 (https://phabricator.wikimedia.org/T290967)
[09:19:18] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:37] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:28] <wikibugs>	 (03PS3) 10ArielGlenn: update wme html dumps downloader to use JWT auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/755345 (https://phabricator.wikimedia.org/T273585)
[09:26:12] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] update wme html dumps downloader to use JWT auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/755345 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn)
[09:30:02] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, double-checked control plane IPs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/755924 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[09:31:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18970 and previous config saved to /var/cache/conftool/dbconfig/20220121-093120-root.json
[09:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add master IPs to main/wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755924 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[09:32:29] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T291946)
[09:32:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: support probes behind SSO [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946)
[09:32:42] <wikibugs>	 (03Abandoned) 10Elukey: role::pki::root: add the ml_serve intermediate PKI [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[09:33:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:33:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: support probes behind SSO [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:35:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add master IPs to main/wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755924 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[09:37:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33373/console" [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:39:49] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T291946)
[09:39:51] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: support probes behind SSO [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946)
[09:40:41] <logmsgbot>	 !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3063.esams.wmnet with OS buster
[09:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster executed with errors: - cp30...
[09:41:35] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3063.esams.wmnet with OS buster
[09:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster
[09:45:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:46:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:47:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: support probes behind SSO [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:49:45] <wikibugs>	 (03PS1) 10Marostegui: misc.my.cnf.erb: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/755932 (https://phabricator.wikimedia.org/T287244)
[09:50:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] misc.my.cnf.erb: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/755932 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui)
[09:50:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1004.eqiad.wmnet with reason: Switch back to plain disks
[09:50:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1004.eqiad.wmnet with reason: Switch back to plain disks
[09:51:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:38] <moritzm>	 !log switch kubetcd1004 back to plain disks
[09:51:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10Jelto) p:05Triage→03Medium
[09:52:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:55:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10Jelto) thanks for the request. To proceed with this request approval is needed from @odimitrijevic and your manager @WDoranWMF .
[09:55:58] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:58:04] <wikibugs>	 (03PS1) 10Ayounsi: Add scs-f8-eqiad to Icinga and Rancid [puppet] - 10https://gerrit.wikimedia.org/r/755933 (https://phabricator.wikimedia.org/T298980)
[09:58:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "x2 hosts: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755780
[09:59:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "x2 hosts: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755780 (owner: 10Marostegui)
[10:07:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: Switch back to plain disks
[10:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: Switch back to plain disks
[10:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:53] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2023.codfw.wmnet with OS buster
[10:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:15] <moritzm>	 !log switch kubetcd1005 back to plain disks
[10:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add scs-f8-eqiad to Icinga and Rancid [puppet] - 10https://gerrit.wikimedia.org/r/755933 (https://phabricator.wikimedia.org/T298980) (owner: 10Ayounsi)
[10:14:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1006.eqiad.wmnet with reason: Switch back to plain disks
[10:14:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1006.eqiad.wmnet with reason: Switch back to plain disks
[10:14:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:47] <moritzm>	 !log switch kubetcd1006 back to plain disks
[10:14:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:18] <wikibugs>	 (03CR) 10Hashar: gerrit: use default for index.batchThreads (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755329 (owner: 10Hashar)
[10:22:28] <wikibugs>	 (03PS2) 10Hashar: gerrit: use default for index.batchThreads [puppet] - 10https://gerrit.wikimedia.org/r/755329
[10:23:33] <wikibugs>	 (03CR) 10Hashar: "Now that we run Gerrit 3.x and no more have a database backend, I am making the index batch threads to use the default value (based on num" [puppet] - 10https://gerrit.wikimedia.org/r/407857 (owner: 10Dzahn)
[10:26:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff)
[10:27:43] <wikibugs>	 (03PS2) 10Muehlenhoff: Make ganeti1025 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755723
[10:33:44] <moritzm>	 !log migrate primary/secondary instances off ganeti1013
[10:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:09] <wikibugs>	 (03PS1) 10Kormat: Prepare for 0.8.1 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/755947 (https://phabricator.wikimedia.org/T297605)
[10:47:28] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] Prepare for 0.8.1 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/755947 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat)
[10:47:38] <wikibugs>	 (03PS1) 10Hashar: ci: always use a LVM volume for Docker data [puppet] - 10https://gerrit.wikimedia.org/r/755948
[10:50:07] <wikibugs>	 (03Merged) 10jenkins-bot: Prepare for 0.8.1 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/755947 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat)
[10:53:16] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I have cherry picked it on the integration puppet master and it is working as expected: it is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/755948 (owner: 10Hashar)
[10:55:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1025 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755723 (owner: 10Muehlenhoff)
[10:58:46] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3063.esams.wmnet with OS buster
[10:58:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster completed: - cp3063 (**WARN*...
[11:14:35] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2023.codfw.wmnet with OS buster
[11:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:00] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Rename main cluster to wikikube (1/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris)
[11:15:29] <vgutierrez>	 !log pool cp3063 running envoy as TLS termination layer - T271421
[11:15:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:32] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[11:16:02] <wikibugs>	 (03PS1) 10Elukey: kserve: move to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755955 (https://phabricator.wikimedia.org/T298976)
[11:17:48] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2023.codfw.wmnet
[11:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:13] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2024.codfw.wmnet with OS buster
[11:18:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:56] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1016.eqiad.wmnet with OS buster
[11:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:01] <wikibugs>	 (03PS1) 10Elukey: envoy-future: add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550)
[11:24:31] <wikibugs>	 (03PS1) 10Hnowlan: postgres: add option to enable replication slots [puppet] - 10https://gerrit.wikimedia.org/r/755959 (https://phabricator.wikimedia.org/T290149)
[11:25:43] <wikibugs>	 (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey)
[11:26:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve: move to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755955 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[11:30:26] <wikibugs>	 (03PS2) 10Hnowlan: postgres: add option to enable replication slots [puppet] - 10https://gerrit.wikimedia.org/r/755959 (https://phabricator.wikimedia.org/T290149)
[11:31:33] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:35] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33376/console" [puppet] - 10https://gerrit.wikimedia.org/r/755959 (https://phabricator.wikimedia.org/T290149) (owner: 10Hnowlan)
[11:34:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:34:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:34:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[11:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[11:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:57] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Manually switchover primary stats db db1159 -> db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755960 (https://phabricator.wikimedia.org/T299624)
[11:43:29] <wikibugs>	 (03CR) 10Jcrespo: "See context at: https://phabricator.wikimedia.org/T299624#7639955" [puppet] - 10https://gerrit.wikimedia.org/r/755960 (https://phabricator.wikimedia.org/T299624) (owner: 10Jcrespo)
[11:49:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbbackups: Manually switchover primary stats db db1159 -> db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755960 (https://phabricator.wikimedia.org/T299624) (owner: 10Jcrespo)
[11:56:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet
[11:56:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:57] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:01:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet
[12:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1025.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[12:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:13] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1016.eqiad.wmnet with OS buster
[12:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:16] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2024.codfw.wmnet with OS buster
[12:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:26] <wikibugs>	 (03PS1) 10Hashar: gerrit: set sshd.enableChannelIdTracking=false [puppet] - 10https://gerrit.wikimedia.org/r/755968 (https://phabricator.wikimedia.org/T263293)
[12:25:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney)
[12:25:50] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2025.codfw.wmnet with OS buster
[12:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:54] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1017.eqiad.wmnet with OS buster
[12:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2024.codfw.wmnet
[12:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:33] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1016.eqiad.wmnet
[12:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:52] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti1025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[12:29:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney)
[12:31:16] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti1025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[12:32:52] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:34:19] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) >>! In T299527#7637094, @Cmjohnson wrote: > @MoritzMuehlenhoff The idrac is giving me a hard time, it's not worth slowing this pr...
[12:35:40] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:39:11] <wikibugs>	 (03PS1) 10Elukey: knative-serving: move egress gateway to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755970 (https://phabricator.wikimedia.org/T298976)
[12:48:05] <wikibugs>	 (03PS1) 10Btullis: Use the default prometheus_mysql_exporter for matomo [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762)
[12:49:06] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33377/console" [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis)
[12:50:10] <wikibugs>	 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10Aklapper) Adding #ops-eqiad (feel free to correct) so this ticket can be found.
[12:51:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] envoy-future: add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey)
[12:52:20] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "I will need to remove traves of the previous prometheus-mysqld-exporter@matomo.service manually, once this has been deployed." [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis)
[12:53:01] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:59:38] <wikibugs>	 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10cmooney)
[13:00:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:00:47] <wikibugs>	 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10cmooney)
[13:00:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney)
[13:01:14] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[13:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:23] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s)
[13:01:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) Currently waiting on T299759 to be completed to gain console access to these devices and begin the process.
[13:03:54] <wikibugs>	 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10ayounsi)
[13:05:03] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2025.codfw.wmnet
[13:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:10] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1017.eqiad.wmnet
[13:07:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:22] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo)
[13:09:21] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) p:05Triage→03High
[13:09:31] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1017.eqiad.wmnet with OS buster
[13:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:25] <icinga-wm>	 PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:13:03] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2025.codfw.wmnet with OS buster
[13:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:17] <wikibugs>	 (03PS4) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585)
[13:15:41] <wikibugs>	 (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755834 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar)
[13:15:49] <icinga-wm>	 RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:15:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn)
[13:17:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti1026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755975
[13:19:02] <wikibugs>	 (03PS5) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585)
[13:24:11] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis)
[13:34:10] <wikibugs>	 (03PS6) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585)
[13:42:30] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM as well - Only question I have is should we add other (high entropy) CH-UA header values, or not now." [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx)
[13:44:39] <wikibugs>	 (03PS1) 10JMeybohm: Upgrade staging-eqiad kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/755977 (https://phabricator.wikimedia.org/T290967)
[13:47:44] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn)
[13:48:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 9 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33380/console" [puppet] - 10https://gerrit.wikimedia.org/r/755977 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:49:54] <wikibugs>	 (03PS1) 10JMeybohm: Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967)
[13:50:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:51:36] <wikibugs>	 (03CR) 10ArielGlenn: "Gah forgot to remove the WIP from the commit message after staring at the diff and the pcc output for too long. :-(" [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn)
[13:52:22] <wikibugs>	 (03PS2) 10JMeybohm: Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967)
[13:58:24] <wikibugs>	 (03PS11) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966)
[14:00:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) (owner: 10KartikMistry)
[14:02:34] <wikibugs>	 (03PS12) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966)
[14:07:52] <wikibugs>	 (03PS1) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585)
[14:16:31] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Use the default prometheus_mysql_exporter for matomo [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis)
[14:21:33] <wikibugs>	 (03PS5) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584)
[14:35:07] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[14:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:12] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1018.eqiad.wmnet with OS buster
[14:35:14] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 07s)
[14:35:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:23] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster
[14:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: puppetdb-api: allow prometheus_nodes via ferm [puppet] - 10https://gerrit.wikimedia.org/r/755982 (https://phabricator.wikimedia.org/T291946)
[14:37:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33382/console" [puppet] - 10https://gerrit.wikimedia.org/r/755982 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[14:37:34] <wikibugs>	 (03PS1) 10Joal: Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983
[14:38:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: move egress gateway to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755970 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[14:38:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 (owner: 10Joal)
[14:39:44] <wikibugs>	 (03PS2) 10Joal: Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983
[14:40:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[14:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[14:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 (owner: 10Joal)
[14:45:44] <wikibugs>	 (03PS1) 10Elukey: Move ml-services to the new CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/755984 (https://phabricator.wikimedia.org/T298976)
[14:48:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez)
[14:48:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[14:49:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Move ml-services to the new CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/755984 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[14:50:52] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Make target for validate_state configurable [cookbooks] - 10https://gerrit.wikimedia.org/r/756006
[14:52:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[14:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:33] <wikibugs>	 (03PS3) 10Joal: Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 (https://phabricator.wikimedia.org/T263277)
[14:53:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[14:57:50] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal)
[15:07:41] <herron>	 !log removing kibana.discovery.wmnet record and switching legacy elk LVS instances to state: lvs_setup T299700
[15:07:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:46] <stashbot>	 T299700: Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700
[15:07:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] envoy-future: add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey)
[15:08:11] <wikibugs>	 (03PS2) 10Herron: switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700)
[15:09:20] <wikibugs>	 (03CR) 10Herron: [C: 03+2] remove kibana.discovery.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/755790 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[15:10:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713 (https://phabricator.wikimedia.org/T292729) (owner: 10Hashar)
[15:10:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] ci: always use a LVM volume for Docker data [puppet] - 10https://gerrit.wikimedia.org/r/755948 (owner: 10Hashar)
[15:10:30] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: ci: always use a LVM volume for Docker data [puppet] - 10https://gerrit.wikimedia.org/r/755948 (owner: 10Hashar)
[15:10:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] ci: always use a LVM volume for Docker data [puppet] - 10https://gerrit.wikimedia.org/r/755948 (owner: 10Hashar)
[15:10:43] <wikibugs>	 (03CR) 10Herron: [C: 03+2] switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[15:15:42] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] envoy-future: add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey)
[15:17:14] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] kserve: move to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755955 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey)
[15:22:29] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1018.eqiad.wmnet with OS buster
[15:22:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:01] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1018.eqiad.wmnet
[15:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:23] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1019.eqiad.wmnet with OS buster
[15:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:01] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2026.codfw.wmnet with OS buster
[15:25:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:13] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on mx1001.wikimedia.org with reason: kernel testing
[15:29:15] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on mx1001.wikimedia.org with reason: kernel testing
[15:29:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:09] <icinga-wm>	 RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:55] <icinga-wm>	 RECOVERY - ganeti-mond running on ganeti1025 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[15:35:57] <icinga-wm>	 RECOVERY - ganeti-confd running on ganeti1025 is OK: PROCS OK: 1 process with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[15:37:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM - a couple questions along the way and one usability suggestion." [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[15:42:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[15:46:22] <wikibugs>	 (03PS1) 10Matthias Mullie: Stop capturing media change tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756014 (https://phabricator.wikimedia.org/T286362)
[15:50:15] <moritzm>	 !log added ganeti1025 to Ganeti eqiad cluster T293909
[15:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:19] <stashbot>	 T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909
[15:50:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1025.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[15:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1018.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage
[15:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1018.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage
[15:51:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1013.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage
[15:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1013.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage
[15:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:43] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717
[15:51:45] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718
[15:51:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016
[15:52:58] <wikibugs>	 (03CR) 10Majavah: Simplify management of the request time limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 (owner: 10Giuseppe Lavagetto)
[15:54:26] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff)
[15:54:40] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1013
[16:02:07] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2000 days, 0:00:00 on sodium.wikimedia.org with reason: decom
[16:02:09] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2000 days, 0:00:00 on sodium.wikimedia.org with reason: decom
[16:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:54] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1019.eqiad.wmnet with OS buster
[16:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:33] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1019.eqiad.wmnet
[16:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:02] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.decommission for hosts sodium.wikimedia.org
[16:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto)
[16:09:34] <wikibugs>	 (03PS1) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074)
[16:11:29] <wikibugs>	 (03PS2) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074)
[16:15:30] <wikibugs>	 (03PS3) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074)
[16:16:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) (owner: 10Aqu)
[16:18:41] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided)
[16:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:49] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s)
[16:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:18] <wikibugs>	 10SRE, 10ops-codfw: Possible cable issue on restbase2010 management interface - https://phabricator.wikimedia.org/T299426 (10hnowlan) Given the flapping IPMI checks outside of the reimage issues, I suspect this might be more than a firmware upgrade, but given how some other restbase hosts have performed I'm op...
[16:20:29] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan)
[16:20:29] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1020.eqiad.wmnet with OS buster
[16:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:50] <wikibugs>	 (03PS2) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585)
[16:26:22] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sodium.wikimedia.org
[16:26:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: decom sodium - https://phabricator.wikimedia.org/T298727 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jhathaway@cumin1001 for hosts: `sodium.wikimedia.org` - sodium.wikimedia.org (**PASS**)   - Downtimed host on Icinga   - Found physical host   - Down...
[16:46:37] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1020.eqiad.wmnet with OS buster
[16:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:20] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan)
[16:47:34] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1020.eqiad.wmnet
[16:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:38] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1021.eqiad.wmnet with OS buster
[16:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:49] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan)
[16:55:53] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1021.eqiad.wmnet with OS buster
[16:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:02] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1021.eqiad.wmnet
[16:56:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:49] <wikibugs>	 (03CR) 10JMeybohm: Add basic ingress support to chart common_templates (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[17:19:21] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Return a set, not a list, from active_images() [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748873 (owner: 10RLazarus)
[17:20:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) Services have been moved to lvs_setup, but there are some pybal icinga alerts still open e.g.   ` lvs1015 PyBal IPVS diff check CRITICAL 2022-01-2...
[17:21:44] <wikibugs>	 (03Merged) 10jenkins-bot: Return a set, not a list, from active_images() [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748873 (owner: 10RLazarus)
[17:22:30] <wikibugs>	 (03PS1) 10RLazarus: Release v0.0.4 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/756026
[17:24:52] <wikibugs>	 (03PS1) 10JHathaway: sodium.wikimedia.org: remove reference, decommed [puppet] - 10https://gerrit.wikimedia.org/r/756027 (https://phabricator.wikimedia.org/T298727)
[17:31:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10Andrew) This resembles the more-frequent issues that we've seen on 1003 (T297814) -- it's not exactly a crash, the system just gets so slow that things...
[17:34:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::rabbitmq: add tls ports to firewall [puppet] - 10https://gerrit.wikimedia.org/r/755492 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah)
[17:35:08] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Release v0.0.4 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/756026 (owner: 10RLazarus)
[17:35:15] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] sodium.wikimedia.org: remove reference, decommed [puppet] - 10https://gerrit.wikimedia.org/r/756027 (https://phabricator.wikimedia.org/T298727) (owner: 10JHathaway)
[17:37:09] <wikibugs>	 (03PS3) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345)
[17:37:11] <wikibugs>	 (03PS1) 10AOkoth: gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033
[17:37:26] <wikibugs>	 (03Merged) 10jenkins-bot: Release v0.0.4 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/756026 (owner: 10RLazarus)
[17:38:30] <wikibugs>	 (03PS2) 10AOkoth: gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033
[17:40:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033 (owner: 10AOkoth)
[17:40:57] <wikibugs>	 (03PS3) 10AOkoth: gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033
[17:41:09] <wikibugs>	 (03PS4) 10AOkoth: gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033
[17:42:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: decom sodium - https://phabricator.wikimedia.org/T298727 (10jhathaway)
[17:42:19] <rzl>	 !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/python3-imagecatalog/imagecatalog_0.0.4-1_amd64.changes
[17:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:47] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033 (owner: 10AOkoth)
[17:55:30] <wikibugs>	 (03PS1) 10Herron: remove kibana-disc from discovery-metafo-resources [dns] - 10https://gerrit.wikimedia.org/r/756036 (https://phabricator.wikimedia.org/T299700)
[17:55:56] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] remove kibana-disc from discovery-metafo-resources [dns] - 10https://gerrit.wikimedia.org/r/756036 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[17:57:11] <wikibugs>	 (03CR) 10Herron: [C: 03+2] remove kibana-disc from discovery-metafo-resources [dns] - 10https://gerrit.wikimedia.org/r/756036 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[18:01:32] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:04:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "This seems right but it's been years since I deployed a mw config change; hoping someone else will get it lined up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah)
[18:09:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] LabsServices: use deployment-graphite01 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747494 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah)
[18:11:26] <wikibugs>	 (03PS1) 10Cwhite: Revert "Use the default prometheus_mysql_exporter for matomo" [puppet] - 10https://gerrit.wikimedia.org/r/755998
[18:13:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) Netbox Updates:  * added network port to all ps1-[ef]-eqiad * added power ports (54 or 42 depending on model) to all ps[12]-[ef]-eqiad
[18:15:07] <wikibugs>	 (03PS1) 10Herron: remove realserver_ips from legacy elk roles & set lvs state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/756038 (https://phabricator.wikimedia.org/T299700)
[18:15:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[18:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:03] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] remove realserver_ips from legacy elk roles & set lvs state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/756038 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[18:26:11] <wikibugs>	 (03CR) 10Herron: [C: 03+2] remove realserver_ips from legacy elk roles & set lvs state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/756038 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[18:26:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:24] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:32:59] <wikibugs>	 (03Abandoned) 10Andrew Bogott: passwords: Add ladsgroup to the cloud root [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup)
[18:33:14] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:33:18] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:33:20] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:36:50] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[18:36:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) Added in the IP addresses in the mgmt range that were available, and assigned them to the ps1-[ef]-eqiad with the following:   e1: 10.65.2.45/16 e2: 10.65.2.46/16 e3: 10.65.2.47/16 e4: 10.65....
[18:38:57] <wikibugs>	 (03PS1) 10CDanis: Add a start_timestamp constraint [software/statograph] - 10https://gerrit.wikimedia.org/r/756041 (https://phabricator.wikimedia.org/T298619)
[18:39:42] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:24] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:18] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:45:46] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:46:42] <herron>	 !log restarting pybal on lvs1015,lvs1020,lvs2009,lvs2010 to remove legacy elk5 services T299700
[18:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:46] <stashbot>	 T299700: Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700
[18:49:26] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:49:38] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:51:58] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:59:02] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[19:01:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[19:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:43] <wikibugs>	 (03PS4) 10Herron: remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T299700)
[19:02:17] <wikibugs>	 (03PS13) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966)
[19:02:46] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:02:51] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10Patch-For-Review: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10CDanis) 05Open→03Resolved {F34926035}  It took just a single run of `statograph -v up...
[19:03:51] <jinxer-wm>	 (Juniper alarm active) firing: Juniper alarm active   - https://alerts.wikimedia.org
[19:05:23] <wikibugs>	 (03CR) 10Herron: [C: 03+2] remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[19:05:39] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:06] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[19:10:27] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[19:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:36] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:10:36] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-udp2log on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-udp2log is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:10:44] <wikibugs>	 (03PS1) 10Herron: cleanup kibana.svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700)
[19:11:04] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:08] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-udp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-json-udp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:22] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:24] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:26] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:28] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-gelf on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-gelf is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:30] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-tcp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-json-tcp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:34] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-udp on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-json-udp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:36] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:44] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-gelf on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-gelf is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:46] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana7 on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana7 is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:50] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana7 on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana7 is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:11:52] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:12:07] <wikibugs>	 (03PS2) 10Herron: cleanup logstash and kibana svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700)
[19:12:18] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-tcp on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-json-tcp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:12:48] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-udp2log on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-udp2log is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:12:50] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:14:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:52] <wikibugs>	 (03PS1) 10Herron: remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756046 (https://phabricator.wikimedia.org/T299700)
[19:17:32] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756046 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[19:18:37] <wikibugs>	 (03CR) 10Herron: [C: 03+2] remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756046 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[19:20:40] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal BGP sessions are established on lvs6002 is CRITICAL: 0 le 0 Brandon Black These wont clear until the mx204s get configured in drmrs https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops
[19:20:40] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal BGP sessions are established on lvs6003 is CRITICAL: 0 le 0 Brandon Black These wont clear until the mx204s get configured in drmrs https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops
[19:22:49] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana7 on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana7 is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:27:02] <cdanis>	 herron: ^
[19:27:16] <herron>	 cdanis: thanks, troubleshooting in -traffic
[19:27:22] <cdanis>	 ah ok sorry :)
[19:27:29] <herron>	 no worries thx for the ping
[19:32:27] <wikibugs>	 (03PS1) 10Herron: Revert "remove logstash and kibana entries from conftool-data discovery services" [puppet] - 10https://gerrit.wikimedia.org/r/756000
[19:33:29] <wikibugs>	 (03CR) 10BBlack: cleanup logstash and kibana svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[19:34:21] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Revert "remove logstash and kibana entries from conftool-data discovery services" [puppet] - 10https://gerrit.wikimedia.org/r/756000 (owner: 10Herron)
[19:38:51] <jinxer-wm>	 (Juniper alarm active) resolved: Juniper alarm active   - https://alerts.wikimedia.org
[19:40:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[19:43:53] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana7 on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana7 is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:51:11] <icinga-wm>	 PROBLEM - puppet last run on cloudbackup2002 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:57:50] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/kibana7 on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:57:50] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana7 on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[19:59:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "yea, we don't have a mariadb/mysql backend anymore. and thanks for fixing the link" [puppet] - 10https://gerrit.wikimedia.org/r/755329 (owner: 10Hashar)
[19:59:47] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10Majavah)
[20:00:32] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10Majavah)
[20:00:58] <wikibugs>	 (03PS1) 10Herron: remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756001
[20:03:16] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756001 (owner: 10Herron)
[20:03:21] <wikibugs>	 (03CR) 10Herron: [C: 03+2] remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756001 (owner: 10Herron)
[20:05:32] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/kibana7 on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[20:07:27] <wikibugs>	 (03PS3) 10Herron: cleanup logstash and kibana svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700)
[20:08:00] <wikibugs>	 (03CR) 10Herron: cleanup logstash and kibana svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[20:09:16] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana7 on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[20:09:46] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] cleanup logstash and kibana svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[20:09:55] <wikibugs>	 (03CR) 10Herron: [C: 03+2] cleanup logstash and kibana svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron)
[20:12:54] <icinga-wm>	 RECOVERY - puppet last run on cloudbackup2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:17:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Revert "Use the default prometheus_mysql_exporter for matomo" [puppet] - 10https://gerrit.wikimedia.org/r/755998 (owner: 10Cwhite)
[20:21:47] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron)
[20:21:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) 05Open→03Resolved These have been removed with much help from @BBlack thank you!
[20:25:24] <wikibugs>	 (03PS4) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074)
[20:26:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) (owner: 10Aqu)
[20:26:56] <icinga-wm>	 PROBLEM - Disk space on ml-etcd2002 is CRITICAL: DISK CRITICAL - free space: / 722 MB (3% inode=95%): /tmp 722 MB (3% inode=95%): /var/tmp 722 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops
[20:31:22] <wikibugs>	 (03PS5) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074)
[20:47:08] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission sodium.wikimedia.org - https://phabricator.wikimedia.org/T299785 (10wiki_willy) a:03Cmjohnson
[20:49:23] <wikibugs>	 (03PS3) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585)
[20:51:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10wiki_willy) a:03Cmjohnson Assigning this to @Cmjohnson.  However, I also reached out to @MoritzMuehlenhoff to take a peak at this and T297814 later ne...
[20:53:01] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10wiki_willy) a:03Cmjohnson
[20:55:12] <wikibugs>	 (03PS4) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585)
[20:55:58] <wikibugs>	 (03PS1) 10Cwhite: elasticsearch: write curator logs to stdout [puppet] - 10https://gerrit.wikimedia.org/r/756053 (https://phabricator.wikimedia.org/T297239)
[21:00:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: install logstash-plugins on logging logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/755811 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite)
[21:01:54] <wikibugs>	 (03CR) 10Herron: [C: 03+1] elasticsearch: write curator logs to stdout [puppet] - 10https://gerrit.wikimedia.org/r/756053 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[21:02:23] <MatmaRex>	 hi, sorry to ruin your mood, https://phabricator.wikimedia.org/T299767 might warrant a train rollback
[21:02:49] <dancy>	 Gah!
[21:03:39] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: switch to opensearch output plugin on production logstash [puppet] - 10https://gerrit.wikimedia.org/r/755812 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite)
[21:03:45] <MatmaRex>	 i don't know what causes the issue yet
[21:03:48] <rzl>	 if a Friday push is needed I can be around for SRE
[21:03:56] <MatmaRex>	 and whether train rollback will fix it. but hopefully
[21:03:59] <rzl>	 (for the next ~5 hours)
[21:05:58] <brennen>	 i'm here as well.  i think j.eena may be out today.
[21:09:14] <MatmaRex>	 okay, i know what broke it, we just need a revert
[21:10:45] <MatmaRex>	 rzl: brennen: reverting in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/756005 , are you able to backport and deploy?
[21:11:46] <brennen>	 yeah, i'm able.  also cc: twentyafterfour as backup train deployer in case around.
[21:12:15] <wikibugs>	 (03PS1) 10Cathal Mooney: Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057
[21:14:06] <brennen>	 MatmaRex: cherry picking.  how confident are you as far as the revert needing review?
[21:14:38] <MatmaRex>	 brennen: i tested by monkey-patching the code in browser console, it fixes the issue for me
[21:15:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057 (owner: 10Cathal Mooney)
[21:15:27] <MatmaRex>	 brennen: it also has a +2 now
[21:15:31] <wikibugs>	 (03PS2) 10Cathal Mooney: Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057
[21:15:57] <wikibugs>	 (03PS5) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585)
[21:16:04] <wikibugs>	 (03PS1) 10Brennen Bearnes: Revert "Re-duplicate deduplicated TemplateStyles" [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756066 (https://phabricator.wikimedia.org/T287675)
[21:17:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn)
[21:17:09] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057 (owner: 10Cathal Mooney)
[21:18:11] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] "Tested and reviewed on master, going ahead with backport." [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756066 (https://phabricator.wikimedia.org/T287675) (owner: 10Brennen Bearnes)
[21:18:35] <brennen>	 MatmaRex: cool - thanks and going ahead.
[21:18:35] <wikibugs>	 (03Merged) 10jenkins-bot: Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057 (owner: 10Cathal Mooney)
[21:19:02] <MatmaRex>	 thanks brennen
[21:20:55] * brennen waits on CI, pulls up error logs in meanwhile.
[21:21:04] <wikibugs>	 (03PS6) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585)
[21:21:50] <topranks>	 !log Running homer against cr1-eqiad and cr2-eqiad to remove entries on analytics-in4/6 filters that refer to decommissioned deb mirror host sodium.
[21:21:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn)
[21:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:56] <wikibugs>	 (03PS7) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585)
[21:34:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Re-duplicate deduplicated TemplateStyles" [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756066 (https://phabricator.wikimedia.org/T287675) (owner: 10Brennen Bearnes)
[21:36:28] <brennen>	 MatmaRex: patch on mwdebug1002 if you want to test; ready to sync.
[21:36:49] <MatmaRex>	 brennen: thanks. yeah, i can
[21:37:24] <MatmaRex>	 brennen: looks fixed
[21:37:38] <brennen>	 cool, syncing
[21:38:56] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/VisualEditor/modules/ve-mw: Backport: [[gerrit:756066|Revert "Re-duplicate deduplicated TemplateStyles" (T287675 T299251 T299767)]] (duration: 00m 49s)
[21:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:03] <stashbot>	 T287675: De-duplicated TemplateStyles missing when editing a section in visual section editing - https://phabricator.wikimedia.org/T287675
[21:39:03] <stashbot>	 T299767: Triggering Infobox duplication: Adds a large block of source text - https://phabricator.wikimedia.org/T299767
[21:39:03] <stashbot>	 T299251: Visual diffs sometimes missing TemplateStyles - https://phabricator.wikimedia.org/T299251
[21:39:59] <MatmaRex>	 thanks brennen
[21:40:30] <MatmaRex>	 hope the rest of your weekend is better than this :D
[21:40:42] <rzl>	 ^ agreed on both counts!
[21:40:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[21:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:32] <brennen>	 MatmaRex: same to you. :)
[21:42:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[21:42:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[21:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[21:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:11] <wikibugs>	 (03PS1) 10Cathal Mooney: Modify labs-in[4|6] filters in eqiad to allow traffic to codfw backups [homer/public] - 10https://gerrit.wikimedia.org/r/756060
[21:50:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "thank you!" [homer/public] - 10https://gerrit.wikimedia.org/r/756060 (owner: 10Cathal Mooney)
[21:54:46] <wikibugs>	 (03PS2) 10Cathal Mooney: Modify labs-in[4|6] filters in eqiad to allow traffic to codfw backups [homer/public] - 10https://gerrit.wikimedia.org/r/756060
[21:55:35] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify labs-in[4|6] filters in eqiad to allow traffic to codfw backups [homer/public] - 10https://gerrit.wikimedia.org/r/756060 (owner: 10Cathal Mooney)
[21:55:55] <jinxer-wm>	 (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash - https://alerts.wikimedia.org
[21:56:13] <wikibugs>	 (03Merged) 10jenkins-bot: Modify labs-in[4|6] filters in eqiad to allow traffic to codfw backups [homer/public] - 10https://gerrit.wikimedia.org/r/756060 (owner: 10Cathal Mooney)
[21:59:43] * cwhite looking into logstash
[22:06:13] <wikibugs>	 (03PS1) 10Accraze: ml-services: add draftquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/756064 (https://phabricator.wikimedia.org/T298989)
[22:06:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_logstash site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:08:46] <icinga-wm>	 PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: logstash.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:08:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:10:10] <wikibugs>	 (03PS1) 10Cwhite: Revert "logstash: install logstash-plugins on logging logstash clusters" [puppet] - 10https://gerrit.wikimedia.org/r/756067
[22:10:32] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] Revert "logstash: install logstash-plugins on logging logstash clusters" [puppet] - 10https://gerrit.wikimedia.org/r/756067 (owner: 10Cwhite)
[22:13:32] <icinga-wm>	 RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:20:55] <jinxer-wm>	 (LogstashNoLogsIndexed) resolved: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash - https://alerts.wikimedia.org
[22:21:40] <wikibugs>	 (03PS8) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585)
[22:21:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org
[22:23:55] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mx1001.wikimedia.org with reason: kernel testing
[22:23:56] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mx1001.wikimedia.org with reason: kernel testing
[22:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org
[22:29:05] <wikibugs>	 (03PS1) 10Cathal Mooney: Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089
[22:30:44] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 (owner: 10Cathal Mooney)
[22:31:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 (owner: 10Cathal Mooney)
[22:31:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 (owner: 10Cathal Mooney)
[22:31:54] <wikibugs>	 (03Merged) 10jenkins-bot: Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 (owner: 10Cathal Mooney)
[22:31:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org
[22:39:18] <icinga-wm>	 PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:43:08] <wikibugs>	 (03PS1) 10Cwhite: Revert "Revert "logstash: install logstash-plugins on logging logstash clusters"" [puppet] - 10https://gerrit.wikimedia.org/r/756068
[22:45:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Revert "Revert "logstash: install logstash-plugins on logging logstash clusters"" [puppet] - 10https://gerrit.wikimedia.org/r/756068 (owner: 10Cwhite)