[00:00:05] brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220121T0000). [00:00:29] o/ [00:00:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:48] no one for training to day and no patches in the queue; calling it. [00:02:59] +1 [00:04:34] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628) (owner: 10Eigyan) [00:26:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_network_flows_internal-sanitization_daily.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:16] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [00:33:38] (03CR) 10Cwhite: [C: 03+1] switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [00:33:48] (03CR) 10Cwhite: [C: 03+1] remove kibana.discovery.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/755790 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [00:34:59] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [00:35:27] (03CR) 10Cwhite: [C: 03+1] hieradata: add host-specific Prometheus data [puppet] - 10https://gerrit.wikimedia.org/r/755711 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [00:36:29] (03CR) 10Cwhite: [C: 03+1] Add prometheus[12]00[56] to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/755708 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [01:35:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:40:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:44:41] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [02:13:19] (03PS1) 10Scardenasmolinar: Lower The Wikipedia Library editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755834 (https://phabricator.wikimedia.org/T288070) [03:33:17] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:13:17] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 58.04 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:15:37] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 87.1 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:38:20] (03PS1) 10Marostegui: Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755769 [05:41:56] (03CR) 10Marostegui: [C: 03+2] Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755769 (owner: 10Marostegui) [05:42:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18950 and previous config saved to /var/cache/conftool/dbconfig/20220121-054228-root.json [05:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:47] (03PS1) 10Marostegui: es2030,es2032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755850 (https://phabricator.wikimedia.org/T299741) [05:48:40] (03CR) 10Marostegui: [C: 03+2] es2030,es2032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755850 (https://phabricator.wikimedia.org/T299741) (owner: 10Marostegui) [05:49:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2030.codfw.wmnet with OS bullseye [05:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18951 and previous config saved to /var/cache/conftool/dbconfig/20220121-055732-root.json [05:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18952 and previous config saved to /var/cache/conftool/dbconfig/20220121-061235-root.json [06:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:46] (03PS1) 10Marostegui: Revert "es2030,es2032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755770 [06:19:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2030.codfw.wmnet with OS bullseye [06:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:45] (03CR) 10Marostegui: [C: 03+2] Revert "es2030,es2032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755770 (owner: 10Marostegui) [06:21:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2032 to es1 master T299741', diff saved to https://phabricator.wikimedia.org/P18953 and previous config saved to /var/cache/conftool/dbconfig/20220121-062116-marostegui.json [06:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:20] T299741: Upgrade es1 to Bullseye - https://phabricator.wikimedia.org/T299741 [06:22:54] (03PS1) 10Marostegui: es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755853 (https://phabricator.wikimedia.org/T299741) [06:23:46] (03CR) 10Marostegui: [C: 03+2] es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755853 (https://phabricator.wikimedia.org/T299741) (owner: 10Marostegui) [06:24:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bullseye [06:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 20%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18954 and previous config saved to /var/cache/conftool/dbconfig/20220121-062739-root.json [06:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18955 and previous config saved to /var/cache/conftool/dbconfig/20220121-064243-root.json [06:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bullseye [06:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:36] (03PS1) 10Marostegui: Revert "es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755771 [06:56:26] (03CR) 10Marostegui: [C: 03+2] Revert "es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755771 (owner: 10Marostegui) [06:57:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 40%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18956 and previous config saved to /var/cache/conftool/dbconfig/20220121-065746-root.json [06:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032 for reimage T299741', diff saved to https://phabricator.wikimedia.org/P18957 and previous config saved to /var/cache/conftool/dbconfig/20220121-065854-marostegui.json [06:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:58] T299741: Upgrade es1 to Bullseye - https://phabricator.wikimedia.org/T299741 [07:00:02] (03PS1) 10Marostegui: es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755856 (https://phabricator.wikimedia.org/T299741) [07:01:00] (03CR) 10Marostegui: [C: 03+2] es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755856 (https://phabricator.wikimedia.org/T299741) (owner: 10Marostegui) [07:04:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1032.eqiad.wmnet with OS bullseye [07:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18958 and previous config saved to /var/cache/conftool/dbconfig/20220121-071250-root.json [07:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:57] !log systemctl reset-failed session-3.scope on an-test-client1001 (failed, transient unit) [07:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:14] !log elukey@build2001:~$ sudo systemctl reset-failed ifup@ens13.service [07:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:28] !log elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics.service [07:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 60%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18959 and previous config saved to /var/cache/conftool/dbconfig/20220121-072754-root.json [07:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:12] (03PS1) 10Marostegui: Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755772 [07:30:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1032.eqiad.wmnet with OS bullseye [07:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18960 and previous config saved to /var/cache/conftool/dbconfig/20220121-073051-root.json [07:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:58] (03CR) 10Marostegui: [C: 03+2] Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755772 (owner: 10Marostegui) [07:32:33] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:01] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:35:04] (03PS1) 10Marostegui: core_multiinstance.my.cnf.erb: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/755915 (https://phabricator.wikimedia.org/T287244) [07:36:37] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:36:58] (03CR) 10Marostegui: [C: 03+2] core_multiinstance.my.cnf.erb: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/755915 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [07:42:09] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18961 and previous config saved to /var/cache/conftool/dbconfig/20220121-074257-root.json [07:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18962 and previous config saved to /var/cache/conftool/dbconfig/20220121-074555-root.json [07:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:21] (03PS1) 10Marostegui: Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755773 [07:52:27] (03CR) 10Marostegui: [C: 03+2] Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755773 (owner: 10Marostegui) [07:58:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: repooling after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P18963 and previous config saved to /var/cache/conftool/dbconfig/20220121-075801-root.json [07:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220121T0800) [08:00:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18964 and previous config saved to /var/cache/conftool/dbconfig/20220121-080058-root.json [08:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:16:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18965 and previous config saved to /var/cache/conftool/dbconfig/20220121-081602-root.json [08:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:18] PROBLEM - carbon-cache@h service on cloudmetrics1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:17:24] PROBLEM - carbon-cache@c service on cloudmetrics1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:17:38] PROBLEM - carbon-local-relay service on cloudmetrics1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:19:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:19:24] RECOVERY - carbon-cache@h service on cloudmetrics1004 is OK: OK - carbon-cache@h is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:19:30] RECOVERY - carbon-cache@c service on cloudmetrics1004 is OK: OK - carbon-cache@c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:19:44] RECOVERY - carbon-local-relay service on cloudmetrics1004 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:27:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet [08:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18966 and previous config saved to /var/cache/conftool/dbconfig/20220121-083106-root.json [08:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet [08:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1018.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [08:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1018.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [08:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:42:13] (03PS1) 10Vgutierrez: site: Reimage cp3063 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/755917 (https://phabricator.wikimedia.org/T271421) [08:42:17] (03PS1) 10Filippo Giunchedi: prometheus: bump open files limit for blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/755918 (https://phabricator.wikimedia.org/T296199) [08:44:30] (03PS2) 10Filippo Giunchedi: prometheus: bump open files limit for blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/755918 (https://phabricator.wikimedia.org/T296199) [08:46:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18967 and previous config saved to /var/cache/conftool/dbconfig/20220121-084609-root.json [08:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:21] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump open files limit for blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/755918 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:48:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:27] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: switch to opensearch output plugin on production logstash [puppet] - 10https://gerrit.wikimedia.org/r/755812 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [08:53:20] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: install logstash-plugins on logging logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/755811 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [08:53:40] (03CR) 10Filippo Giunchedi: [C: 03+1] switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [08:53:48] (03CR) 10Filippo Giunchedi: [C: 03+1] remove kibana.discovery.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/755790 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [08:54:00] (03PS2) 10ArielGlenn: update wme html dumps downloader to use JWT auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/755345 (https://phabricator.wikimedia.org/T273585) [08:54:40] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:55:43] (03PS1) 10JMeybohm: Remove the hacks to around masquerade-all [deployment-charts] - 10https://gerrit.wikimedia.org/r/755920 (https://phabricator.wikimedia.org/T290967) [08:56:04] (03PS2) 10JMeybohm: Remove the hacks around masquerade-all [deployment-charts] - 10https://gerrit.wikimedia.org/r/755920 (https://phabricator.wikimedia.org/T290967) [08:56:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:57:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:58:50] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:00:35] !log depool cp3063 to be reimaged as cache::upload_envoy - T271421 [09:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:39] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [09:01:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18968 and previous config saved to /var/cache/conftool/dbconfig/20220121-090113-root.json [09:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:17] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3063 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/755917 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:03:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:18] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3063.esams.wmnet with OS buster [09:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:30] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster [09:04:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:05:17] (03PS1) 10Filippo Giunchedi: prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T296199) [09:06:28] (03CR) 10JMeybohm: [C: 03+2] "Just FYI - this will produce a diff on ml as well as the staging-codfw node IP's where hardcoded" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755920 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:06:33] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:06:35] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [09:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:50] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:10:30] (03Merged) 10jenkins-bot: Remove the hacks around masquerade-all [deployment-charts] - 10https://gerrit.wikimedia.org/r/755920 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:11:04] RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [09:11:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:11:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:38] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10aborrero) [09:13:15] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10aborrero) [09:13:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10aborrero) [09:13:40] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10aborrero) [09:13:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10aborrero) [09:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18969 and previous config saved to /var/cache/conftool/dbconfig/20220121-091617-root.json [09:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:34] (03PS1) 10JMeybohm: Add master IPs to main/wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755924 (https://phabricator.wikimedia.org/T290967) [09:19:18] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:37] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:28] (03PS3) 10ArielGlenn: update wme html dumps downloader to use JWT auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/755345 (https://phabricator.wikimedia.org/T273585) [09:26:12] (03CR) 10ArielGlenn: [C: 03+2] update wme html dumps downloader to use JWT auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/755345 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [09:30:02] (03CR) 10Jelto: [C: 03+1] "lgtm, double-checked control plane IPs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/755924 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:31:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18970 and previous config saved to /var/cache/conftool/dbconfig/20220121-093120-root.json [09:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:25] (03CR) 10JMeybohm: [C: 03+2] Add master IPs to main/wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755924 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:32:29] (03PS2) 10Filippo Giunchedi: prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T291946) [09:32:31] (03PS1) 10Filippo Giunchedi: prometheus: support probes behind SSO [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946) [09:32:42] (03Abandoned) 10Elukey: role::pki::root: add the ml_serve intermediate PKI [puppet] - 10https://gerrit.wikimedia.org/r/755651 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [09:33:19] (03CR) 10jerkins-bot: [V: 04-1] prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:33:31] (03CR) 10jerkins-bot: [V: 04-1] prometheus: support probes behind SSO [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:35:23] (03Merged) 10jenkins-bot: Add master IPs to main/wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755924 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:37:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33373/console" [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:39:49] (03PS3) 10Filippo Giunchedi: prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T291946) [09:39:51] (03PS2) 10Filippo Giunchedi: prometheus: support probes behind SSO [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946) [09:40:41] !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3063.esams.wmnet with OS buster [09:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:49] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster executed with errors: - cp30... [09:41:35] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3063.esams.wmnet with OS buster [09:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:44] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster [09:45:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:46:47] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: override valid status codes for http probes [puppet] - 10https://gerrit.wikimedia.org/r/755922 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:47:02] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: support probes behind SSO [puppet] - 10https://gerrit.wikimedia.org/r/755927 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:49:45] (03PS1) 10Marostegui: misc.my.cnf.erb: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/755932 (https://phabricator.wikimedia.org/T287244) [09:50:35] (03CR) 10Marostegui: [C: 03+2] misc.my.cnf.erb: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/755932 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [09:50:55] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1004.eqiad.wmnet with reason: Switch back to plain disks [09:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1004.eqiad.wmnet with reason: Switch back to plain disks [09:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:38] !log switch kubetcd1004 back to plain disks [09:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:01] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10Jelto) p:05Triage→03Medium [09:52:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:55:41] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10Jelto) thanks for the request. To proceed with this request approval is needed from @odimitrijevic and your manager @WDoranWMF . [09:55:58] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:58:04] (03PS1) 10Ayounsi: Add scs-f8-eqiad to Icinga and Rancid [puppet] - 10https://gerrit.wikimedia.org/r/755933 (https://phabricator.wikimedia.org/T298980) [09:58:47] (03PS1) 10Marostegui: Revert "x2 hosts: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755780 [09:59:27] (03CR) 10Marostegui: [C: 03+2] Revert "x2 hosts: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755780 (owner: 10Marostegui) [10:07:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: Switch back to plain disks [10:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: Switch back to plain disks [10:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:53] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2023.codfw.wmnet with OS buster [10:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:15] !log switch kubetcd1005 back to plain disks [10:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:27] (03CR) 10Ayounsi: [C: 03+2] Add scs-f8-eqiad to Icinga and Rancid [puppet] - 10https://gerrit.wikimedia.org/r/755933 (https://phabricator.wikimedia.org/T298980) (owner: 10Ayounsi) [10:14:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1006.eqiad.wmnet with reason: Switch back to plain disks [10:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1006.eqiad.wmnet with reason: Switch back to plain disks [10:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:47] !log switch kubetcd1006 back to plain disks [10:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:18] (03CR) 10Hashar: gerrit: use default for index.batchThreads (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755329 (owner: 10Hashar) [10:22:28] (03PS2) 10Hashar: gerrit: use default for index.batchThreads [puppet] - 10https://gerrit.wikimedia.org/r/755329 [10:23:33] (03CR) 10Hashar: "Now that we run Gerrit 3.x and no more have a database backend, I am making the index batch threads to use the default value (based on num" [puppet] - 10https://gerrit.wikimedia.org/r/407857 (owner: 10Dzahn) [10:26:24] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [10:27:43] (03PS2) 10Muehlenhoff: Make ganeti1025 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755723 [10:33:44] !log migrate primary/secondary instances off ganeti1013 [10:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:09] (03PS1) 10Kormat: Prepare for 0.8.1 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/755947 (https://phabricator.wikimedia.org/T297605) [10:47:28] (03CR) 10Kormat: [C: 03+2] Prepare for 0.8.1 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/755947 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [10:47:38] (03PS1) 10Hashar: ci: always use a LVM volume for Docker data [puppet] - 10https://gerrit.wikimedia.org/r/755948 [10:50:07] (03Merged) 10jenkins-bot: Prepare for 0.8.1 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/755947 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [10:53:16] (03CR) 10Hashar: [C: 03+1] "I have cherry picked it on the integration puppet master and it is working as expected: it is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/755948 (owner: 10Hashar) [10:55:43] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1025 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755723 (owner: 10Muehlenhoff) [10:58:46] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3063.esams.wmnet with OS buster [10:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:55] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster completed: - cp3063 (**WARN*... [11:14:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2023.codfw.wmnet with OS buster [11:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:00] (03Abandoned) 10Alexandros Kosiaris: Rename main cluster to wikikube (1/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [11:15:29] !log pool cp3063 running envoy as TLS termination layer - T271421 [11:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:32] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [11:16:02] (03PS1) 10Elukey: kserve: move to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755955 (https://phabricator.wikimedia.org/T298976) [11:17:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2023.codfw.wmnet [11:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:13] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2024.codfw.wmnet with OS buster [11:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:56] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1016.eqiad.wmnet with OS buster [11:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:01] (03PS1) 10Elukey: envoy-future: add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) [11:24:31] (03PS1) 10Hnowlan: postgres: add option to enable replication slots [puppet] - 10https://gerrit.wikimedia.org/r/755959 (https://phabricator.wikimedia.org/T290149) [11:25:43] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [11:26:40] (03CR) 10Elukey: [C: 03+2] kserve: move to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755955 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [11:30:26] (03PS2) 10Hnowlan: postgres: add option to enable replication slots [puppet] - 10https://gerrit.wikimedia.org/r/755959 (https://phabricator.wikimedia.org/T290149) [11:31:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:35] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33376/console" [puppet] - 10https://gerrit.wikimedia.org/r/755959 (https://phabricator.wikimedia.org/T290149) (owner: 10Hnowlan) [11:34:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:27] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:57] (03PS1) 10Jcrespo: dbbackups: Manually switchover primary stats db db1159 -> db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755960 (https://phabricator.wikimedia.org/T299624) [11:43:29] (03CR) 10Jcrespo: "See context at: https://phabricator.wikimedia.org/T299624#7639955" [puppet] - 10https://gerrit.wikimedia.org/r/755960 (https://phabricator.wikimedia.org/T299624) (owner: 10Jcrespo) [11:49:53] (03CR) 10Marostegui: [C: 03+1] dbbackups: Manually switchover primary stats db db1159 -> db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755960 (https://phabricator.wikimedia.org/T299624) (owner: 10Jcrespo) [11:56:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [11:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:57] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:01:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [12:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1025.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [12:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:13] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1016.eqiad.wmnet with OS buster [12:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:16] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2024.codfw.wmnet with OS buster [12:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:26] (03PS1) 10Hashar: gerrit: set sshd.enableChannelIdTracking=false [puppet] - 10https://gerrit.wikimedia.org/r/755968 (https://phabricator.wikimedia.org/T263293) [12:25:03] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney) [12:25:50] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2025.codfw.wmnet with OS buster [12:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:54] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1017.eqiad.wmnet with OS buster [12:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2024.codfw.wmnet [12:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:33] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1016.eqiad.wmnet [12:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:52] PROBLEM - ganeti-confd running on ganeti1025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:29:40] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney) [12:31:16] PROBLEM - ganeti-mond running on ganeti1025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [12:32:52] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [12:34:19] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) >>! In T299527#7637094, @Cmjohnson wrote: > @MoritzMuehlenhoff The idrac is giving me a hard time, it's not worth slowing this pr... [12:35:40] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:39:11] (03PS1) 10Elukey: knative-serving: move egress gateway to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755970 (https://phabricator.wikimedia.org/T298976) [12:48:05] (03PS1) 10Btullis: Use the default prometheus_mysql_exporter for matomo [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) [12:49:06] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33377/console" [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [12:50:10] 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10Aklapper) Adding #ops-eqiad (feel free to correct) so this ticket can be found. [12:51:07] (03CR) 10JMeybohm: [C: 03+1] envoy-future: add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [12:52:20] (03CR) 10Btullis: [V: 03+1] "I will need to remove traves of the previous prometheus-mysqld-exporter@matomo.service manually, once this has been deployed." [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [12:53:01] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:45] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:59:38] 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10cmooney) [13:00:09] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:00:47] 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10cmooney) [13:00:50] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney) [13:01:14] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [13:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:23] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [13:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:59] 10SRE, 10Infrastructure-Foundations, 10netops: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) Currently waiting on T299759 to be completed to gain console access to these devices and begin the process. [13:03:54] 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10ayounsi) [13:05:03] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2025.codfw.wmnet [13:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:10] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1017.eqiad.wmnet [13:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:22] 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) [13:09:21] 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) p:05Triage→03High [13:09:31] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1017.eqiad.wmnet with OS buster [13:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:25] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:13:03] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2025.codfw.wmnet with OS buster [13:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:17] (03PS4) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) [13:15:41] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755834 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [13:15:49] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:15:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [13:17:58] (03PS1) 10Muehlenhoff: Make ganeti1026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755975 [13:19:02] (03PS5) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) [13:24:11] (03CR) 10MVernon: [C: 03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [13:34:10] (03PS6) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) [13:42:30] (03CR) 10Joal: [C: 03+1] "LGTM as well - Only question I have is should we add other (high entropy) CH-UA header values, or not now." [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [13:44:39] (03PS1) 10JMeybohm: Upgrade staging-eqiad kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/755977 (https://phabricator.wikimedia.org/T290967) [13:47:44] (03CR) 10ArielGlenn: [C: 03+2] [WIP] add enterprise html dumps downloader settings and credentials files [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [13:48:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 9 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33380/console" [puppet] - 10https://gerrit.wikimedia.org/r/755977 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:49:54] (03PS1) 10JMeybohm: Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967) [13:50:32] (03CR) 10jerkins-bot: [V: 04-1] Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:51:36] (03CR) 10ArielGlenn: "Gah forgot to remove the WIP from the commit message after staring at the diff and the pcc output for too long. :-(" [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [13:52:22] (03PS2) 10JMeybohm: Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967) [13:58:24] (03PS11) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [14:00:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) (owner: 10KartikMistry) [14:02:34] (03PS12) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [14:07:52] (03PS1) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) [14:16:31] (03CR) 10Btullis: [V: 03+1 C: 03+2] Use the default prometheus_mysql_exporter for matomo [puppet] - 10https://gerrit.wikimedia.org/r/755971 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [14:21:33] (03PS5) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) [14:35:07] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [14:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1018.eqiad.wmnet with OS buster [14:35:14] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 07s) [14:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:23] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster [14:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:14] (03PS1) 10Filippo Giunchedi: puppetdb-api: allow prometheus_nodes via ferm [puppet] - 10https://gerrit.wikimedia.org/r/755982 (https://phabricator.wikimedia.org/T291946) [14:37:15] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33382/console" [puppet] - 10https://gerrit.wikimedia.org/r/755982 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:37:34] (03PS1) 10Joal: Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 [14:38:03] (03CR) 10Elukey: [C: 03+2] knative-serving: move egress gateway to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755970 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [14:38:09] (03CR) 10jerkins-bot: [V: 04-1] Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 (owner: 10Joal) [14:39:44] (03PS2) 10Joal: Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 [14:40:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:37] (03CR) 10jerkins-bot: [V: 04-1] Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 (owner: 10Joal) [14:45:44] (03PS1) 10Elukey: Move ml-services to the new CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/755984 (https://phabricator.wikimedia.org/T298976) [14:48:37] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [14:48:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:49:35] (03CR) 10Elukey: [C: 03+2] Move ml-services to the new CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/755984 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [14:50:52] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Make target for validate_state configurable [cookbooks] - 10https://gerrit.wikimedia.org/r/756006 [14:52:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [14:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:33] (03PS3) 10Joal: Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 (https://phabricator.wikimedia.org/T263277) [14:53:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:57:50] (03CR) 10Ottomata: [C: 03+2] Update network_flows_internal druid indexation job [puppet] - 10https://gerrit.wikimedia.org/r/755983 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [15:07:41] !log removing kibana.discovery.wmnet record and switching legacy elk LVS instances to state: lvs_setup T299700 [15:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:46] T299700: Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 [15:07:54] (03CR) 10Hnowlan: [C: 03+1] envoy-future: add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [15:08:11] (03PS2) 10Herron: switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700) [15:09:20] (03CR) 10Herron: [C: 03+2] remove kibana.discovery.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/755790 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [15:10:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] ci: set Docker partition size explicitly [puppet] - 10https://gerrit.wikimedia.org/r/755713 (https://phabricator.wikimedia.org/T292729) (owner: 10Hashar) [15:10:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] ci: always use a LVM volume for Docker data [puppet] - 10https://gerrit.wikimedia.org/r/755948 (owner: 10Hashar) [15:10:30] (03PS2) 10Alexandros Kosiaris: ci: always use a LVM volume for Docker data [puppet] - 10https://gerrit.wikimedia.org/r/755948 (owner: 10Hashar) [15:10:40] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] ci: always use a LVM volume for Docker data [puppet] - 10https://gerrit.wikimedia.org/r/755948 (owner: 10Hashar) [15:10:43] (03CR) 10Herron: [C: 03+2] switch legacy elk LVS entries to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/755789 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [15:15:42] (03CR) 10Elukey: [V: 03+2 C: 03+2] envoy-future: add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/755957 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [15:17:14] (03CR) 10Klausman: [C: 03+1] kserve: move to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/755955 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [15:22:29] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1018.eqiad.wmnet with OS buster [15:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:01] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1018.eqiad.wmnet [15:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:23] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1019.eqiad.wmnet with OS buster [15:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2026.codfw.wmnet with OS buster [15:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:13] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on mx1001.wikimedia.org with reason: kernel testing [15:29:15] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on mx1001.wikimedia.org with reason: kernel testing [15:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:09] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:55] RECOVERY - ganeti-mond running on ganeti1025 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [15:35:57] RECOVERY - ganeti-confd running on ganeti1025 is OK: PROCS OK: 1 process with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:37:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM - a couple questions along the way and one usability suggestion." [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:42:58] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) [15:46:22] (03PS1) 10Matthias Mullie: Stop capturing media change tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756014 (https://phabricator.wikimedia.org/T286362) [15:50:15] !log added ganeti1025 to Ganeti eqiad cluster T293909 [15:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:19] T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 [15:50:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1025.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [15:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1018.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [15:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1018.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [15:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1013.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [15:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1013.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [15:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:43] (03PS8) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 [15:51:45] (03PS5) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 [15:51:47] (03PS1) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [15:52:58] (03CR) 10Majavah: Simplify management of the request time limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 (owner: 10Giuseppe Lavagetto) [15:54:26] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [15:54:40] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1013 [16:02:07] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2000 days, 0:00:00 on sodium.wikimedia.org with reason: decom [16:02:09] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2000 days, 0:00:00 on sodium.wikimedia.org with reason: decom [16:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:54] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1019.eqiad.wmnet with OS buster [16:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:33] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1019.eqiad.wmnet [16:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:02] !log jhathaway@cumin1001 START - Cookbook sre.hosts.decommission for hosts sodium.wikimedia.org [16:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:51] (03CR) 10Ladsgroup: [C: 03+1] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [16:09:34] (03PS1) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) [16:11:29] (03PS2) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) [16:15:30] (03PS3) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) [16:16:37] (03CR) 10jerkins-bot: [V: 04-1] Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) (owner: 10Aqu) [16:18:41] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [16:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:49] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [16:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:18] 10SRE, 10ops-codfw: Possible cable issue on restbase2010 management interface - https://phabricator.wikimedia.org/T299426 (10hnowlan) Given the flapping IPMI checks outside of the reimage issues, I suspect this might be more than a firmware upgrade, but given how some other restbase hosts have performed I'm op... [16:20:29] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [16:20:29] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1020.eqiad.wmnet with OS buster [16:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:50] (03PS2) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) [16:26:22] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sodium.wikimedia.org [16:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:29] 10SRE, 10Infrastructure-Foundations: decom sodium - https://phabricator.wikimedia.org/T298727 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jhathaway@cumin1001 for hosts: `sodium.wikimedia.org` - sodium.wikimedia.org (**PASS**) - Downtimed host on Icinga - Found physical host - Down... [16:46:37] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1020.eqiad.wmnet with OS buster [16:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:20] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [16:47:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1020.eqiad.wmnet [16:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1021.eqiad.wmnet with OS buster [16:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:49] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [16:55:53] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1021.eqiad.wmnet with OS buster [16:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1021.eqiad.wmnet [16:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:49] (03CR) 10JMeybohm: Add basic ingress support to chart common_templates (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [17:19:21] (03CR) 10RLazarus: [C: 03+2] Return a set, not a list, from active_images() [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748873 (owner: 10RLazarus) [17:20:09] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) Services have been moved to lvs_setup, but there are some pybal icinga alerts still open e.g. ` lvs1015 PyBal IPVS diff check CRITICAL 2022-01-2... [17:21:44] (03Merged) 10jenkins-bot: Return a set, not a list, from active_images() [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748873 (owner: 10RLazarus) [17:22:30] (03PS1) 10RLazarus: Release v0.0.4 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/756026 [17:24:52] (03PS1) 10JHathaway: sodium.wikimedia.org: remove reference, decommed [puppet] - 10https://gerrit.wikimedia.org/r/756027 (https://phabricator.wikimedia.org/T298727) [17:31:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10Andrew) This resembles the more-frequent issues that we've seen on 1003 (T297814) -- it's not exactly a crash, the system just gets so slow that things... [17:34:35] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::rabbitmq: add tls ports to firewall [puppet] - 10https://gerrit.wikimedia.org/r/755492 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [17:35:08] (03CR) 10RLazarus: [C: 03+2] Release v0.0.4 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/756026 (owner: 10RLazarus) [17:35:15] (03CR) 10JHathaway: [C: 03+2] sodium.wikimedia.org: remove reference, decommed [puppet] - 10https://gerrit.wikimedia.org/r/756027 (https://phabricator.wikimedia.org/T298727) (owner: 10JHathaway) [17:37:09] (03PS3) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) [17:37:11] (03PS1) 10AOkoth: gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033 [17:37:26] (03Merged) 10jenkins-bot: Release v0.0.4 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/756026 (owner: 10RLazarus) [17:38:30] (03PS2) 10AOkoth: gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033 [17:40:38] (03CR) 10Dzahn: [C: 03+1] gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033 (owner: 10AOkoth) [17:40:57] (03PS3) 10AOkoth: gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033 [17:41:09] (03PS4) 10AOkoth: gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033 [17:42:02] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: decom sodium - https://phabricator.wikimedia.org/T298727 (10jhathaway) [17:42:19] !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/python3-imagecatalog/imagecatalog_0.0.4-1_amd64.changes [17:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:47] (03CR) 10AOkoth: [C: 03+2] gitlab: change restore manifest formatting [puppet] - 10https://gerrit.wikimedia.org/r/756033 (owner: 10AOkoth) [17:55:30] (03PS1) 10Herron: remove kibana-disc from discovery-metafo-resources [dns] - 10https://gerrit.wikimedia.org/r/756036 (https://phabricator.wikimedia.org/T299700) [17:55:56] (03CR) 10BBlack: [C: 03+1] remove kibana-disc from discovery-metafo-resources [dns] - 10https://gerrit.wikimedia.org/r/756036 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [17:57:11] (03CR) 10Herron: [C: 03+2] remove kibana-disc from discovery-metafo-resources [dns] - 10https://gerrit.wikimedia.org/r/756036 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [18:01:32] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:45] (03CR) 10Andrew Bogott: [C: 03+1] "This seems right but it's been years since I deployed a mw config change; hoping someone else will get it lined up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah) [18:09:36] (03CR) 10Andrew Bogott: [C: 03+1] LabsServices: use deployment-graphite01 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747494 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [18:11:26] (03PS1) 10Cwhite: Revert "Use the default prometheus_mysql_exporter for matomo" [puppet] - 10https://gerrit.wikimedia.org/r/755998 [18:13:55] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) Netbox Updates: * added network port to all ps1-[ef]-eqiad * added power ports (54 or 42 depending on model) to all ps[12]-[ef]-eqiad [18:15:07] (03PS1) 10Herron: remove realserver_ips from legacy elk roles & set lvs state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/756038 (https://phabricator.wikimedia.org/T299700) [18:15:26] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [18:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:03] (03CR) 10BBlack: [C: 03+1] remove realserver_ips from legacy elk roles & set lvs state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/756038 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [18:26:11] (03CR) 10Herron: [C: 03+2] remove realserver_ips from legacy elk roles & set lvs state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/756038 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [18:26:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:24] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:32:59] (03Abandoned) 10Andrew Bogott: passwords: Add ladsgroup to the cloud root [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [18:33:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:33:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:33:20] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:36:50] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:30] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) Added in the IP addresses in the mgmt range that were available, and assigned them to the ps1-[ef]-eqiad with the following: e1: 10.65.2.45/16 e2: 10.65.2.46/16 e3: 10.65.2.47/16 e4: 10.65.... [18:38:57] (03PS1) 10CDanis: Add a start_timestamp constraint [software/statograph] - 10https://gerrit.wikimedia.org/r/756041 (https://phabricator.wikimedia.org/T298619) [18:39:42] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:24] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:18] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:45:46] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:46:42] !log restarting pybal on lvs1015,lvs1020,lvs2009,lvs2010 to remove legacy elk5 services T299700 [18:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:46] T299700: Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 [18:49:26] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:49:38] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:51:58] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:59:02] (03CR) 10BBlack: [C: 03+1] remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [19:01:08] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [19:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:43] (03PS4) 10Herron: remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T299700) [19:02:17] (03PS13) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [19:02:46] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:02:51] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10Patch-For-Review: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10CDanis) 05Open→03Resolved {F34926035} It took just a single run of `statograph -v up... [19:03:51] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [19:05:23] (03CR) 10Herron: [C: 03+2] remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [19:05:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:06] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) [19:10:27] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [19:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:10:36] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-udp2log on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-udp2log is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:10:44] (03PS1) 10Herron: cleanup kibana.svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) [19:11:04] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:08] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-udp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-json-udp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:22] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:24] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:26] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:28] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-gelf on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-gelf is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-tcp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-json-tcp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:34] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-udp on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-json-udp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:44] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-gelf on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-gelf is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:46] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana7 on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana7 is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:50] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana7 on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana7 is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:11:52] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:12:07] (03PS2) 10Herron: cleanup logstash and kibana svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) [19:12:18] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-tcp on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-json-tcp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:12:48] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-udp2log on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-udp2log is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:12:50] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:14:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:52] (03PS1) 10Herron: remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756046 (https://phabricator.wikimedia.org/T299700) [19:17:32] (03CR) 10BBlack: [C: 03+1] remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756046 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [19:18:37] (03CR) 10Herron: [C: 03+2] remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756046 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [19:20:40] ACKNOWLEDGEMENT - PyBal BGP sessions are established on lvs6002 is CRITICAL: 0 le 0 Brandon Black These wont clear until the mx204s get configured in drmrs https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [19:20:40] ACKNOWLEDGEMENT - PyBal BGP sessions are established on lvs6003 is CRITICAL: 0 le 0 Brandon Black These wont clear until the mx204s get configured in drmrs https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [19:22:49] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana7 on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana7 is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:27:02] herron: ^ [19:27:16] cdanis: thanks, troubleshooting in -traffic [19:27:22] ah ok sorry :) [19:27:29] no worries thx for the ping [19:32:27] (03PS1) 10Herron: Revert "remove logstash and kibana entries from conftool-data discovery services" [puppet] - 10https://gerrit.wikimedia.org/r/756000 [19:33:29] (03CR) 10BBlack: cleanup logstash and kibana svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [19:34:21] (03CR) 10Herron: [C: 03+2] Revert "remove logstash and kibana entries from conftool-data discovery services" [puppet] - 10https://gerrit.wikimedia.org/r/756000 (owner: 10Herron) [19:38:51] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [19:40:24] (03CR) 10Ayounsi: [C: 03+1] Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [19:43:53] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana7 on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana7 is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:51:11] PROBLEM - puppet last run on cloudbackup2002 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:57:50] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kibana7 on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:57:50] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana7 on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:59:31] (03CR) 10Dzahn: [C: 03+2] "yea, we don't have a mariadb/mysql backend anymore. and thanks for fixing the link" [puppet] - 10https://gerrit.wikimedia.org/r/755329 (owner: 10Hashar) [19:59:47] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10Majavah) [20:00:32] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10Majavah) [20:00:58] (03PS1) 10Herron: remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756001 [20:03:16] (03CR) 10BBlack: [C: 03+1] remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756001 (owner: 10Herron) [20:03:21] (03CR) 10Herron: [C: 03+2] remove logstash and kibana entries from conftool-data discovery services [puppet] - 10https://gerrit.wikimedia.org/r/756001 (owner: 10Herron) [20:05:32] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kibana7 on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:07:27] (03PS3) 10Herron: cleanup logstash and kibana svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) [20:08:00] (03CR) 10Herron: cleanup logstash and kibana svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [20:09:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana7 on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:09:46] (03CR) 10BBlack: [C: 03+1] cleanup logstash and kibana svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [20:09:55] (03CR) 10Herron: [C: 03+2] cleanup logstash and kibana svc records [dns] - 10https://gerrit.wikimedia.org/r/756045 (https://phabricator.wikimedia.org/T299700) (owner: 10Herron) [20:12:54] RECOVERY - puppet last run on cloudbackup2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:17:23] (03CR) 10Cwhite: [C: 03+2] Revert "Use the default prometheus_mysql_exporter for matomo" [puppet] - 10https://gerrit.wikimedia.org/r/755998 (owner: 10Cwhite) [20:21:47] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) [20:21:51] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) 05Open→03Resolved These have been removed with much help from @BBlack thank you! [20:25:24] (03PS4) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) [20:26:15] (03CR) 10jerkins-bot: [V: 04-1] Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) (owner: 10Aqu) [20:26:56] PROBLEM - Disk space on ml-etcd2002 is CRITICAL: DISK CRITICAL - free space: / 722 MB (3% inode=95%): /tmp 722 MB (3% inode=95%): /var/tmp 722 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops [20:31:22] (03PS5) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299074) [20:47:08] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission sodium.wikimedia.org - https://phabricator.wikimedia.org/T299785 (10wiki_willy) a:03Cmjohnson [20:49:23] (03PS3) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) [20:51:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10wiki_willy) a:03Cmjohnson Assigning this to @Cmjohnson. However, I also reached out to @MoritzMuehlenhoff to take a peak at this and T297814 later ne... [20:53:01] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10wiki_willy) a:03Cmjohnson [20:55:12] (03PS4) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) [20:55:58] (03PS1) 10Cwhite: elasticsearch: write curator logs to stdout [puppet] - 10https://gerrit.wikimedia.org/r/756053 (https://phabricator.wikimedia.org/T297239) [21:00:27] (03CR) 10Cwhite: [C: 03+2] logstash: install logstash-plugins on logging logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/755811 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [21:01:54] (03CR) 10Herron: [C: 03+1] elasticsearch: write curator logs to stdout [puppet] - 10https://gerrit.wikimedia.org/r/756053 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [21:02:23] hi, sorry to ruin your mood, https://phabricator.wikimedia.org/T299767 might warrant a train rollback [21:02:49] Gah! [21:03:39] (03CR) 10Herron: [C: 03+1] logstash: switch to opensearch output plugin on production logstash [puppet] - 10https://gerrit.wikimedia.org/r/755812 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [21:03:45] i don't know what causes the issue yet [21:03:48] if a Friday push is needed I can be around for SRE [21:03:56] and whether train rollback will fix it. but hopefully [21:03:59] (for the next ~5 hours) [21:05:58] i'm here as well. i think j.eena may be out today. [21:09:14] okay, i know what broke it, we just need a revert [21:10:45] rzl: brennen: reverting in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/756005 , are you able to backport and deploy? [21:11:46] yeah, i'm able. also cc: twentyafterfour as backup train deployer in case around. [21:12:15] (03PS1) 10Cathal Mooney: Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057 [21:14:06] MatmaRex: cherry picking. how confident are you as far as the revert needing review? [21:14:38] brennen: i tested by monkey-patching the code in browser console, it fixes the issue for me [21:15:24] (03CR) 10Ayounsi: [C: 03+1] Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057 (owner: 10Cathal Mooney) [21:15:27] brennen: it also has a +2 now [21:15:31] (03PS2) 10Cathal Mooney: Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057 [21:15:57] (03PS5) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) [21:16:04] (03PS1) 10Brennen Bearnes: Revert "Re-duplicate deduplicated TemplateStyles" [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756066 (https://phabricator.wikimedia.org/T287675) [21:17:07] (03CR) 10jerkins-bot: [V: 04-1] add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [21:17:09] (03CR) 10Cathal Mooney: [C: 03+2] Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057 (owner: 10Cathal Mooney) [21:18:11] (03CR) 10Brennen Bearnes: [C: 03+2] "Tested and reviewed on master, going ahead with backport." [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756066 (https://phabricator.wikimedia.org/T287675) (owner: 10Brennen Bearnes) [21:18:35] MatmaRex: cool - thanks and going ahead. [21:18:35] (03Merged) 10jenkins-bot: Removing entries from cr-analytics filter that refer to 'sodium' [homer/public] - 10https://gerrit.wikimedia.org/r/756057 (owner: 10Cathal Mooney) [21:19:02] thanks brennen [21:20:55] * brennen waits on CI, pulls up error logs in meanwhile. [21:21:04] (03PS6) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) [21:21:50] !log Running homer against cr1-eqiad and cr2-eqiad to remove entries on analytics-in4/6 filters that refer to decommissioned deb mirror host sodium. [21:21:52] (03CR) 10jerkins-bot: [V: 04-1] add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [21:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:56] (03PS7) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) [21:34:03] (03Merged) 10jenkins-bot: Revert "Re-duplicate deduplicated TemplateStyles" [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756066 (https://phabricator.wikimedia.org/T287675) (owner: 10Brennen Bearnes) [21:36:28] MatmaRex: patch on mwdebug1002 if you want to test; ready to sync. [21:36:49] brennen: thanks. yeah, i can [21:37:24] brennen: looks fixed [21:37:38] cool, syncing [21:38:56] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/VisualEditor/modules/ve-mw: Backport: [[gerrit:756066|Revert "Re-duplicate deduplicated TemplateStyles" (T287675 T299251 T299767)]] (duration: 00m 49s) [21:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:03] T287675: De-duplicated TemplateStyles missing when editing a section in visual section editing - https://phabricator.wikimedia.org/T287675 [21:39:03] T299767: Triggering Infobox duplication: Adds a large block of source text - https://phabricator.wikimedia.org/T299767 [21:39:03] T299251: Visual diffs sometimes missing TemplateStyles - https://phabricator.wikimedia.org/T299251 [21:39:59] thanks brennen [21:40:30] hope the rest of your weekend is better than this :D [21:40:42] ^ agreed on both counts! [21:40:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:32] MatmaRex: same to you. :) [21:42:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:42:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:11] (03PS1) 10Cathal Mooney: Modify labs-in[4|6] filters in eqiad to allow traffic to codfw backups [homer/public] - 10https://gerrit.wikimedia.org/r/756060 [21:50:50] (03CR) 10Andrew Bogott: [C: 03+1] "thank you!" [homer/public] - 10https://gerrit.wikimedia.org/r/756060 (owner: 10Cathal Mooney) [21:54:46] (03PS2) 10Cathal Mooney: Modify labs-in[4|6] filters in eqiad to allow traffic to codfw backups [homer/public] - 10https://gerrit.wikimedia.org/r/756060 [21:55:35] (03CR) 10Cathal Mooney: [C: 03+2] Modify labs-in[4|6] filters in eqiad to allow traffic to codfw backups [homer/public] - 10https://gerrit.wikimedia.org/r/756060 (owner: 10Cathal Mooney) [21:55:55] (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash - https://alerts.wikimedia.org [21:56:13] (03Merged) 10jenkins-bot: Modify labs-in[4|6] filters in eqiad to allow traffic to codfw backups [homer/public] - 10https://gerrit.wikimedia.org/r/756060 (owner: 10Cathal Mooney) [21:59:43] * cwhite looking into logstash [22:06:13] (03PS1) 10Accraze: ml-services: add draftquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/756064 (https://phabricator.wikimedia.org/T298989) [22:06:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_logstash site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:08:46] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: logstash.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:10:10] (03PS1) 10Cwhite: Revert "logstash: install logstash-plugins on logging logstash clusters" [puppet] - 10https://gerrit.wikimedia.org/r/756067 [22:10:32] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Revert "logstash: install logstash-plugins on logging logstash clusters" [puppet] - 10https://gerrit.wikimedia.org/r/756067 (owner: 10Cwhite) [22:13:32] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:55] (LogstashNoLogsIndexed) resolved: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash - https://alerts.wikimedia.org [22:21:40] (03PS8) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) [22:21:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [22:23:55] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mx1001.wikimedia.org with reason: kernel testing [22:23:56] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mx1001.wikimedia.org with reason: kernel testing [22:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [22:29:05] (03PS1) 10Cathal Mooney: Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 [22:30:44] (03CR) 10Majavah: [C: 03+1] Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 (owner: 10Cathal Mooney) [22:31:11] (03CR) 10Andrew Bogott: [C: 03+1] Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 (owner: 10Cathal Mooney) [22:31:18] (03CR) 10Cathal Mooney: [C: 03+2] Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 (owner: 10Cathal Mooney) [22:31:54] (03Merged) 10jenkins-bot: Add TCP port 6812 to ports allowed from cloudbackup to cloudceph eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/756089 (owner: 10Cathal Mooney) [22:31:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [22:39:18] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:43:08] (03PS1) 10Cwhite: Revert "Revert "logstash: install logstash-plugins on logging logstash clusters"" [puppet] - 10https://gerrit.wikimedia.org/r/756068 [22:45:40] (03CR) 10Cwhite: [C: 03+2] Revert "Revert "logstash: install logstash-plugins on logging logstash clusters"" [puppet] - 10https://gerrit.wikimedia.org/r/756068 (owner: 10Cwhite)