[00:01:15] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[00:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:08] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[00:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:47] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:09] <icinga-wm>	 PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:11] <icinga-wm>	 PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:35] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:22:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: add pki to logging env [puppet] - 10https://gerrit.wikimedia.org/r/769711 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite)
[00:22:41] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:27:57] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:28:07] <wikibugs>	 (03PS1) 10Cwhite: Revert "hiera: add pki to logging env" [puppet] - 10https://gerrit.wikimedia.org/r/769563
[00:31:04] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Revert "hiera: add pki to logging env" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769563 (owner: 10Cwhite)
[00:33:17] <TimStarling>	 !log on mwmaint1002 running populateGlobalEditCount.php
[00:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:05] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:41:12] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/769563 | Reverted ]] due to puppet failures:   # I think the cloud puppetmaster doesn't have a cert at `...
[00:42:51] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:15:19] <wikibugs>	 (03PS1) 10RLazarus: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231)
[01:25:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:40:14] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[02:04:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org
[02:09:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org
[02:12:19] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:26:07] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:34:09] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:34:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:36:53] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:07:23] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:29:22] <wikibugs>	 (03PS1) 10Legoktm: Revert "Cache Badtitle 400s for 60s in varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/769827
[03:41:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:44:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:01:03] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:02:53] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.153.111 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[04:05:41] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.153.111 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[05:29:25] <icinga-wm>	 RECOVERY - Check systemd state on datahubsearch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:45] <icinga-wm>	 PROBLEM - Check systemd state on datahubsearch1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:40:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[05:45:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P22360 and previous config saved to /var/cache/conftool/dbconfig/20220311-054514-marostegui.json
[05:45:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:54:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22361 and previous config saved to /var/cache/conftool/dbconfig/20220311-055409-root.json
[05:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22362 and previous config saved to /var/cache/conftool/dbconfig/20220311-060913-root.json
[06:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Switchover m2-master [dns] - 10https://gerrit.wikimedia.org/r/769708 (owner: 10Marostegui)
[06:13:40] <marostegui>	 !log Reboot dbproxy1014 T303174
[06:13:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:39] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769828
[06:16:07] <wikibugs>	 (03PS1) 10Marostegui: Revert "wmnet: Failover m1-master" [dns] - 10https://gerrit.wikimedia.org/r/769829
[06:16:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769828 (owner: 10Marostegui)
[06:16:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org
[06:21:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org
[06:24:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22363 and previous config saved to /var/cache/conftool/dbconfig/20220311-062417-root.json
[06:24:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:25] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:39:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22364 and previous config saved to /var/cache/conftool/dbconfig/20220311-063921-root.json
[06:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:05] <_joe_>	 uh not sure if it's connected to the cr2-esams issue
[07:09:13] <_joe_>	 btu I can't reach gerrit via ipv6
[07:09:35] <_joe_>	 marostegui: can you reach gerrit rn?
[07:14:30] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:16:17] <marostegui>	 works for me _joe_ 
[07:20:57] <_joe_>	 marostegui: yeah for me too after the recovery
[07:34:08] <wikibugs>	 (03PS10) 10Giuseppe Lavagetto: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond)
[07:34:10] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond)
[07:34:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: P:cache::base: add netmapper file for abuse networks [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471)
[07:34:14] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: C:varnish: load abuse_networks.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769900 (https://phabricator.wikimedia.org/T302471)
[07:34:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: C:varnish: introduce the X-Abuse-Network request "header" [puppet] - 10https://gerrit.wikimedia.org/r/769901 (https://phabricator.wikimedia.org/T302471)
[07:34:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902
[07:53:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for uwsgi-netbox-scriptproxy [puppet] - 10https://gerrit.wikimedia.org/r/767834 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[07:55:52] <wikibugs>	 (03PS19) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471)
[07:58:33] <wikibugs>	 (03PS2) 10Jcrespo: Add Cumin alias for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff)
[07:58:42] <wikibugs>	 (03PS3) 10Jcrespo: Add Cumin alias for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff)
[07:59:30] <wikibugs>	 (03PS4) 10Jcrespo: Add Cumin alias for mediabackups worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff)
[07:59:52] <wikibugs>	 (03PS5) 10Jcrespo: Add Cumin alias for mediabackup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220311T0800)
[08:01:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Add Cumin alias for mediabackup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff)
[08:17:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Add profile::java to role::builder to install JDK 8/11 [puppet] - 10https://gerrit.wikimedia.org/r/769908
[08:18:57] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769908 (owner: 10Muehlenhoff)
[08:19:08] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye
[08:19:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host cloudvirt1017.eqiad.wmnet
[08:21:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add profile::java to role::builder to install JDK 8/11 [puppet] - 10https://gerrit.wikimedia.org/r/769908 (owner: 10Muehlenhoff)
[08:23:31] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirt1017.eqiad.wmnet
[08:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Also enable component/jdk8 for bullseye, also present there [puppet] - 10https://gerrit.wikimedia.org/r/769909
[08:30:40] <jynus>	 !log upgrade and restart db1145
[08:30:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:24] <wikibugs>	 (03PS2) 10Muehlenhoff: Also enable component/jdk8 for bullseye, also present there [puppet] - 10https://gerrit.wikimedia.org/r/769909
[08:38:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769909 (owner: 10Muehlenhoff)
[08:40:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ayounsi) I spent some time on cloudvirt1017 yesterday, I was able to confirm that: * When on the live host, with tcpdump, `sudo dh...
[08:41:44] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host cloudvirt1017.eqiad.wmnet
[08:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:00] <jynus>	 !log upgrade and restart db2139
[08:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Also enable component/jdk8 for bullseye, also present there [puppet] - 10https://gerrit.wikimedia.org/r/769909 (owner: 10Muehlenhoff)
[08:43:47] <dcausse>	 !log restarting blazegraph on wdqs1012 (jvm stuck for 5hours)
[08:43:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Also enable component/jdk8 for bullseye, also present there" [puppet] - 10https://gerrit.wikimedia.org/r/769910
[08:46:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "Also enable component/jdk8 for bullseye, also present there" [puppet] - 10https://gerrit.wikimedia.org/r/769910 (owner: 10Muehlenhoff)
[08:50:40] <icinga-wm>	 PROBLEM - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:50:41] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T303585 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:50:44] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T303585 (10ops-monitoring-bot)
[08:51:34] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: cleanup service unit file [puppet] - 10https://gerrit.wikimedia.org/r/769737 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[08:51:43] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirt1017.eqiad.wmnet
[08:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:19] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye
[08:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:26] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[08:55:59] <jayme>	 hmm...
[08:57:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] C:varnish: Load public-clouds.json via netmapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond)
[09:00:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] C:varnish: use X-Public-Cloud to store the cloud provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond)
[09:00:26] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[09:00:51] <jayme>	 !log kubernetes2011:~# systemctl restart rsyslog.service - T289766
[09:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:55] <stashbot>	 T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766
[09:01:22] <elukey>	 jayme: I am about to send a code change to reimage the node :D
[09:15:16] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye
[09:15:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye
[09:15:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:36] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10observability, 10cloud-services-team (Kanban): 2 systemctl services failing on cloudcontrol hosts: prometheus-openstack-exporter and logrotate - https://phabricator.wikimedia.org/T303511 (10aborrero)
[09:27:42] <wikibugs>	 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Joe) p:05Triage→03High @thcipriani I guess we need your approval for this.
[09:29:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet
[09:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet
[09:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:46] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye
[09:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye
[09:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[09:42:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10dcaro) I'm not very knowledgeable in this subject, so I'm probably not making sense but, some things that are not clear to me xd *...
[09:42:06] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on ganeti2013 is CRITICAL: cluster=ganeti device=sdb instance=ganeti2013 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ganeti2013&var-datasource=codfw+prometheus/ops
[09:42:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet
[09:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet
[09:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:29] <vgutierrez>	 !log stopping certspotter on alert1001
[09:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ayounsi) Some progress: `name=install1003 DHCP discover,lines=20 09:19:22.791939 IP (tos 0x0, ttl 64, id 14579, offset 0, flags [n...
[09:52:52] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:54:25] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1017.eqiad.wmnet with OS bullseye
[09:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:52] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2011 [puppet] - 10https://gerrit.wikimedia.org/r/769919 (https://phabricator.wikimedia.org/T300744)
[09:57:54] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2012 [puppet] - 10https://gerrit.wikimedia.org/r/769920 (https://phabricator.wikimedia.org/T300744)
[09:57:58] <elukey>	 jayme: --^ all yours :)
[10:00:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs for kubernetes2011 [puppet] - 10https://gerrit.wikimedia.org/r/769919 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[10:01:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs for kubernetes2012 [puppet] - 10https://gerrit.wikimedia.org/r/769920 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[10:01:25] <jayme>	 I really like those!
[10:02:19] <elukey>	 \o/ thanks!
[10:02:25] <elukey>	 going to prep for the 2011's reimage
[10:02:28] <wikibugs>	 10SRE, 10Traffic-Icebox: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10Vgutierrez)
[10:03:50] <dcausse>	 !log manually installed jvmquake to wdqs1010 (test machine) from https://people.wikimedia.org/~jmm/jvmquake/
[10:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:03] <wikibugs>	 (03PS1) 10Jbond: C:varnish: drop carries netmapper config [puppet] - 10https://gerrit.wikimedia.org/r/769927
[10:04:25] <wikibugs>	 (03PS1) 10Vgutierrez: certspotter: Temporarily disable certspotter [puppet] - 10https://gerrit.wikimedia.org/r/769928 (https://phabricator.wikimedia.org/T303593)
[10:04:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[10:04:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[10:04:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
[10:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
[10:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes2011 [puppet] - 10https://gerrit.wikimedia.org/r/769919 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[10:06:30] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34204/console" [puppet] - 10https://gerrit.wikimedia.org/r/769928 (https://phabricator.wikimedia.org/T303593) (owner: 10Vgutierrez)
[10:07:44] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] certspotter: Temporarily disable certspotter [puppet] - 10https://gerrit.wikimedia.org/r/769928 (https://phabricator.wikimedia.org/T303593) (owner: 10Vgutierrez)
[10:08:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] P:tcpircbot: cleanup allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/768662 (owner: 10Majavah)
[10:09:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/768662 (owner: 10Majavah)
[10:09:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet
[10:09:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2011.codfw.wmnet with OS bullseye
[10:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Make k8s-ingress-wikikube page [puppet] - 10https://gerrit.wikimedia.org/r/767078 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[10:13:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Please also fix the docker images if they still need to 😊" [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) (owner: 10JMeybohm)
[10:14:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet
[10:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:56] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:14:58] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:15:11] <elukey>	 this is me reimaging kubernetes2011 --^
[10:16:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:16:29] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons.
[10:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2110.codfw.wmnet with OS bullseye
[10:16:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:55] <wikibugs>	 (03CR) 10Jbond: C:varnish: Load public-clouds.json via netmapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond)
[10:19:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet
[10:19:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:13] <wikibugs>	 (03PS2) 10Phuedx: Remove unused wgWMESearchRelevancePages config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748
[10:21:10] <wikibugs>	 (03PS7) 10Jbond: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511
[10:21:30] <wikibugs>	 (03PS9) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391)
[10:22:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[10:23:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769900 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[10:24:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769901 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[10:24:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet
[10:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2011.codfw.wmnet with reason: host reimage
[10:24:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:34] <vgutierrez>	 !log disable certspotter - T303593
[10:25:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:37] <stashbot>	 T303593: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593
[10:26:41] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:26:48] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10Vgutierrez) p:05Triage→03Medium
[10:28:11] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2011.codfw.wmnet with reason: host reimage
[10:28:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2110.codfw.wmnet with reason: host reimage
[10:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:33:46] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite I had a chat with John and the only current supported way is to have a self-hosted puppet master in the cloud project, so I am wondering if this is some...
[10:34:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:34:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2110.codfw.wmnet with reason: host reimage
[10:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:49] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons.
[10:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:00] <wikibugs>	 (03CR) 10Jbond: "there is also phabricator_abusers which is used in misc-frontend[1]" [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto)
[10:38:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ayounsi) @Jclark-ctr let's hold on putting public hosts in the new rows for now. So ideally those would go to A-D.
[10:39:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons.
[10:39:03] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:40:55] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2011.codfw.wmnet with OS bullseye
[10:40:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:05] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes2012 [puppet] - 10https://gerrit.wikimedia.org/r/769920 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[10:46:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2012.codfw.wmnet with OS bullseye
[10:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2110.codfw.wmnet with OS bullseye
[10:49:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:18] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:53:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:58:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:58:41] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:59:27] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons.
[10:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2012.codfw.wmnet with reason: host reimage
[11:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:03:41] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:05:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:05:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2012.codfw.wmnet with reason: host reimage
[11:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10elukey)
[11:08:11] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "See inline for nits, -1 is just for the leftover merge artefacts" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[11:09:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10BTullis) Yes, I'm happy to shut down these nodes whenever @Cmjohnson prefers.
[11:10:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:10:30] <wikibugs>	 (03PS25) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[11:10:48] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[11:11:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:11:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet
[11:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[11:13:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[11:13:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[11:13:40] <wikibugs>	 (03PS1) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[11:13:45] <wikibugs>	 (03PS1) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117)
[11:13:51] <wikibugs>	 (03PS1) 10MVernon: swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117)
[11:14:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[11:14:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[11:15:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[11:16:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet
[11:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:40] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 133, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:18:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2012.codfw.wmnet with OS bullseye
[11:18:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:19:09] <wikibugs>	 (03PS2) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[11:19:20] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[11:21:33] <wikibugs>	 (03PS2) 10MVernon: swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117)
[11:26:50] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) So there's multiple issues here:  According to the Dell support matrix : "H750/HBA350i/HBA355e require 20.04.2 minimum" https://linux.dell.com/files/supportmatrix/Ubuntu_LTS_Support_Ma...
[11:26:52] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH
[11:28:46] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I've updated the helm charts for datahub so that the secrets handling is compatible with our puppet based secret handling method.  T...
[11:32:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769780 (owner: 10JHathaway)
[11:33:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769782 (owner: 10JHathaway)
[11:33:47] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:36:53] <wikibugs>	 (03PS2) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117)
[11:38:40] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399)
[11:40:03] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:40:26] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[11:41:00] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[11:42:19] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis)
[11:44:19] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I've updated the diagram to clarify the way that traffic is intended to flow within the deployment - i.e. requests to the GMS do not...
[11:47:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10MoritzMuehlenhoff)
[11:48:17] <wikibugs>	 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10MoritzMuehlenhoff) 05Open→03Resolved cumin2002 is the active Cumin host in codfw, decommission of cumin2001 happens via https://phabricator.wikimedia.org/T303399
[11:51:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cumin2001.codfw.wmnet
[11:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/769780 (owner: 10JHathaway)
[11:55:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:56:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:55] <wikibugs>	 (03PS3) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[11:58:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[11:59:52] <wikibugs>	 (03PS4) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[12:00:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:00:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:38] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[12:03:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cumin2001.codfw.wmnet
[12:03:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:16] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add several ASNs to those that alert as critical from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/767862 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[12:11:10] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new QFX switches in Eqiad row E/F to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/769950 (https://phabricator.wikimedia.org/T299758)
[12:15:05] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10cmooney) @MatthewVernon apologies for the late reply, I've been only working part-time the last few days as I'd been ill.  I think it is fine to proceed, but...
[12:22:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add new QFX switches in Eqiad row E/F to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/769950 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[12:25:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) I'll carry out this work.  I can also confirm that Njideka is WMF staff on the Data Engineering team and that she requires these pri...
[12:27:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add new QFX switches in Eqiad row E/F to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/769950 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[12:27:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis)
[12:31:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) p:05Triage→03Medium
[12:41:13] <wikibugs>	 (03PS3) 10MVernon: swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117)
[12:42:00] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[12:42:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis)
[12:47:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) Technically, the procedure states that I'm also supposed to wait for @Ottomata to approve, although @Milimetric has also been approv...
[12:48:27] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: add support for the netbox-hiera repo [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/769953
[12:48:29] <wikibugs>	 (03PS6) 10Jcrespo: Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959)
[12:56:56] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Marostegui) @MoritzMuehlenhoff @Volans I guess ^ means we also need to replace our raid monitoring tools?
[13:03:39] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) >>! In T297913#7769796, @Marostegui wrote: > @MoritzMuehlenhoff @Volans I guess ^ means we also need to replace our raid monitoring tools?  Yes, our monitoring calls the megacli binary...
[13:04:06] <wikibugs>	 (03CR) 10Jcrespo: Check that xtrabackup --prepare is using the same version (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) (owner: 10Jcrespo)
[13:15:35] <wikibugs>	 (03PS2) 10Marostegui: Revert "wmnet: Failover m1-master" [dns] - 10https://gerrit.wikimedia.org/r/769829
[13:16:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m1-master" [dns] - 10https://gerrit.wikimedia.org/r/769829 (owner: 10Marostegui)
[13:19:16] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[13:19:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10ayounsi)
[13:21:42] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[13:24:34] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[13:25:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) `cross-validate-accounts` exits without error. ` btullis@mwmaint1002:~$ cross-validate-accounts --username nokafor --uid 38462 --ema...
[13:28:22] <wikibugs>	 (03PS1) 10Btullis: Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516)
[13:29:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis)
[13:30:47] <wikibugs>	 (03PS2) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[13:30:49] <wikibugs>	 (03PS2) 10Btullis: Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516)
[13:33:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10ayounsi) @Dale_Zhou  I can't find any Wikitech user with the name "Dale_Zhou" see the instructions on https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer...
[13:33:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10ayounsi) a:03Dale_Zhou
[13:34:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123', diff saved to https://phabricator.wikimedia.org/P22366 and previous config saved to /var/cache/conftool/dbconfig/20220311-133407-marostegui.json
[13:34:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10ayounsi)
[13:36:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P22367 and previous config saved to /var/cache/conftool/dbconfig/20220311-133633-root.json
[13:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[13:43:41] <jelto>	 !log update pcc facts
[13:43:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:38] <wikibugs>	 (03PS1) 10Jbond: examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972
[13:46:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 (owner: 10Jbond)
[13:48:06] <wikibugs>	 (03PS1) 10Ayounsi: Add shubhankar to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/769973 (https://phabricator.wikimedia.org/T303032)
[13:49:25] <marostegui>	 !log dbmaint on s1@eqiad T298294
[13:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:29] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[13:49:39] <marostegui>	 !log dbmaint on s8@eqiad T300775
[13:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:42] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[13:51:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P22368 and previous config saved to /var/cache/conftool/dbconfig/20220311-135137-root.json
[13:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: allow injecting rate-limiting rules from hiera [puppet] - 10https://gerrit.wikimedia.org/r/769975
[13:56:20] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: allow injecting rate-limiting rules from hiera [puppet] - 10https://gerrit.wikimedia.org/r/769975
[13:56:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/769973 (https://phabricator.wikimedia.org/T303032) (owner: 10Ayounsi)
[13:56:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add shubhankar to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/769973 (https://phabricator.wikimedia.org/T303032) (owner: 10Ayounsi)
[13:57:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/769973 (https://phabricator.wikimedia.org/T303032) (owner: 10Ayounsi)
[14:03:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add stub data for fe_ratelimit injection [labs/private] - 10https://gerrit.wikimedia.org/r/769977 (https://phabricator.wikimedia.org/T303534)
[14:03:38] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10ayounsi) 05Open→03Resolved a:03ayounsi @ShubhankarP you should now have access,  The doc on https://wikitech.wikimedia.org/wiki/SRE/Productio...
[14:04:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add stub data for fe_ratelimit injection [labs/private] - 10https://gerrit.wikimedia.org/r/769977 (https://phabricator.wikimedia.org/T303534) (owner: 10Giuseppe Lavagetto)
[14:04:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769836
[14:05:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1170:3317', diff saved to https://phabricator.wikimedia.org/P22369 and previous config saved to /var/cache/conftool/dbconfig/20220311-140549-marostegui.json
[14:05:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769836 (owner: 10Marostegui)
[14:06:37] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs settings for kubernetes2013 [puppet] - 10https://gerrit.wikimedia.org/r/769978 (https://phabricator.wikimedia.org/T300744)
[14:06:39] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744)
[14:06:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22370 and previous config saved to /var/cache/conftool/dbconfig/20220311-140641-root.json
[14:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1144: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769837
[14:07:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34211/console" [puppet] - 10https://gerrit.wikimedia.org/r/769975 (owner: 10Giuseppe Lavagetto)
[14:08:24] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1144: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769837 (owner: 10Marostegui)
[14:09:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34212/console" [puppet] - 10https://gerrit.wikimedia.org/r/769975 (owner: 10Giuseppe Lavagetto)
[14:09:23] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1142: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769838
[14:09:44] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1141: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769839
[14:10:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1142: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769838 (owner: 10Marostegui)
[14:10:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1141: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769839 (owner: 10Marostegui)
[14:11:30] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2126,db2095: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769840
[14:12:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2126,db2095: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769840 (owner: 10Marostegui)
[14:13:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] motd::message: add new define for simple motd entries [puppet] - 10https://gerrit.wikimedia.org/r/765265 (owner: 10Jbond)
[14:14:00] <wikibugs>	 (03PS3) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[14:21:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P22371 and previous config saved to /var/cache/conftool/dbconfig/20220311-142147-root.json
[14:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:53] <wikibugs>	 (03PS1) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[14:23:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[14:25:01] <wikibugs>	 (03PS4) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[14:27:40] <wikibugs>	 (03PS2) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[14:27:45] <wikibugs>	 (03PS5) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[14:28:04] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: varnish: allow injecting rate-limiting rules from hiera [puppet] - 10https://gerrit.wikimedia.org/r/769975 (https://phabricator.wikimedia.org/T303534)
[14:29:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs settings for kubernetes2013 [puppet] - 10https://gerrit.wikimedia.org/r/769978 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[14:29:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[14:30:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs settings for kubernetes2013 [puppet] - 10https://gerrit.wikimedia.org/r/769978 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[14:35:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2013.codfw.wmnet with OS bullseye
[14:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P22372 and previous config saved to /var/cache/conftool/dbconfig/20220311-143652-root.json
[14:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:12] <wikibugs>	 (03PS6) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[14:40:59] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:42:01] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:42:21] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34218/console" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[14:43:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2013.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[14:45:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban, 10Patch-For-Review: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10Milimetric) I think Andrew just delegated to me for when he's out of town, but I approve!
[14:48:09] <wikibugs>	 (03PS1) 10Jgiannelos: Revert "mobileapps: Bump to 2022-03-10-175759-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769841
[14:49:46] <wikibugs>	 (03CR) 10Vgutierrez: "overall LGTM (just fix the syntax error). It would be great if tests could be provided as part of modules/varnish/files/tests/text/09-anal" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[14:51:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2013.codfw.wmnet with reason: host reimage
[14:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:19] <nemo-yiannis>	 Hi! Yesterdays deployment on mobileapps broke android on production. Can we deploy out of schedule today ?
[14:51:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P22373 and previous config saved to /var/cache/conftool/dbconfig/20220311-145159-root.json
[14:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:53] <nemo-yiannis>	 cc thcipriani here is the revert patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/769841
[14:54:28] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2013.codfw.wmnet with reason: host reimage
[14:54:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:28] <XioNoX>	 !log cr2-esams AVOID-PATHS as-path TI "6762 .*"
[14:57:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:35] <XioNoX>	 !log cr2-esams AVOID-PATHS as-path TI "6762 .*" <- rolled back
[15:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:43] <XioNoX>	 !log cr1/2-eqiad AVOID-PATHS as-path TI "6762 .*"
[15:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:33] <wikibugs>	 (03PS3) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[15:05:35] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:05:41] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:06:50] <wikibugs>	 (03PS4) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[15:07:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P22374 and previous config saved to /var/cache/conftool/dbconfig/20220311-150702-root.json
[15:07:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:22] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host kubernetes2013.codfw.wmnet with OS bullseye
[15:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2013.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:12:14] <elukey>	 uff failed?
[15:12:53] <elukey>	 ah Failed to get Netbox script results, try manually
[15:13:11] <icinga-wm>	 PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:13:34] <elukey>	 and https://netbox.wikimedia.org/api/extras/job-results/2654017/ doesn't look good
[15:13:57] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648)
[15:15:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10Dale_Zhou)
[15:16:22] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "😅" [puppet] - 10https://gerrit.wikimedia.org/r/769975 (https://phabricator.wikimedia.org/T303534) (owner: 10Giuseppe Lavagetto)
[15:17:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: allow injecting rate-limiting rules from hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769975 (https://phabricator.wikimedia.org/T303534) (owner: 10Giuseppe Lavagetto)
[15:18:58] <wikibugs>	 (03PS5) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[15:19:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10Dale_Zhou) >>! In T303031#7769858, @ayounsi wrote: > @Dale_Zhou  I can't find any Wikitech user with the name "Dale_Zhou" see the instructions on https://wikitech.wikimedi...
[15:20:34] <elukey>	 nemo-yiannis: o/ if the issue is user facing and impacts users right now, I think that a deploy makes sense.
[15:20:50] <elukey>	 do you have all sign-offs for the revert?
[15:21:46] <elukey>	 ah it is a revert in the docker image, yeah I think it is fine in my opinion
[15:21:56] <nemo-yiannis>	 Yeah its introducing some UI changes that don't work on android. I can merge the patch.
[15:22:05] <elukey>	 maybe it would be good to get a +1 from somebody else
[15:22:25] <nemo-yiannis>	 sure
[15:23:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[15:23:51] <wikibugs>	 (03PS6) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[15:23:53] <wikibugs>	 (03PS1) 10Jbond: motd::message: use the title as the message by default [puppet] - 10https://gerrit.wikimedia.org/r/769992
[15:23:55] <wikibugs>	 (03PS2) 10Elukey: Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744)
[15:23:57] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[15:24:00] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:24:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] motd::message: use the title as the message by default [puppet] - 10https://gerrit.wikimedia.org/r/769992 (owner: 10Jbond)
[15:24:55] <wikibugs>	 (03PS1) 10Btullis: Allow access to MariaDB analytics-meta from Kubernetes pods [puppet] - 10https://gerrit.wikimedia.org/r/769993 (https://phabricator.wikimedia.org/T303049)
[15:26:00] <wikibugs>	 (03PS2) 10Jbond: motd::message: use the title as the message by default [puppet] - 10https://gerrit.wikimedia.org/r/769992
[15:26:07] <wikibugs>	 (03PS7) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[15:27:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2014.codfw.wmnet with OS bullseye
[15:27:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:38] <wikibugs>	 (03CR) 10Isabelle Hurbain-Palatin: [C: 03+1] "I checked that this is indeed a revert of the indicated commit, and that the indicated commit is the previous one in the chain, which woul" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769841 (owner: 10Jgiannelos)
[15:27:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow access to MariaDB analytics-meta from Kubernetes pods [puppet] - 10https://gerrit.wikimedia.org/r/769993 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:30:40] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] Revert "mobileapps: Bump to 2022-03-10-175759-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769841 (owner: 10Jgiannelos)
[15:31:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] motd::message: use the title as the message by default [puppet] - 10https://gerrit.wikimedia.org/r/769992 (owner: 10Jbond)
[15:32:06] <nemo-yiannis>	 elukey: thanks, deploying now
[15:32:18] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Allow access to MariaDB analytics-meta from Kubernetes pods [puppet] - 10https://gerrit.wikimedia.org/r/769993 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:33:02] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:05] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2014.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:34:56] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mobileapps: Bump to 2022-03-10-175759-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769841 (owner: 10Jgiannelos)
[15:35:04] <wikibugs>	 (03PS1) 10Ayounsi: Downpref TI in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769994
[15:35:51] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:17] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:36:59] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:45] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:41] <wikibugs>	 (03PS2) 10Jbond: puppet_compiler: add support for the netbox-hiera repo [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/769953
[15:38:49] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:43] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:41:32] <wikibugs>	 (03PS1) 10Ayounsi: Add dalezhou to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/769996 (https://phabricator.wikimedia.org/T303031)
[15:42:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2014.codfw.wmnet with reason: host reimage
[15:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:58] <wikibugs>	 (03PS1) 10Jbond: puppet_compilers: bump to puppet-compiler version 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/769997
[15:44:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) > ** Easiest workaround, run `cloudsw1-c8-eqiad# deactivate vlans cloud-hosts1-eqiad forwarding-options dhcp-security opti...
[15:44:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: add support for the netbox-hiera repo [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/769953 (owner: 10Jbond)
[15:44:47] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2014.codfw.wmnet with reason: host reimage
[15:44:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:16] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:45:25] <wikibugs>	 (03Merged) 10jenkins-bot: puppet_compiler: add support for the netbox-hiera repo [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/769953 (owner: 10Jbond)
[15:47:10] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:48:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compilers: bump to puppet-compiler version 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/769997 (owner: 10Jbond)
[15:49:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2014.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:49:56] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2014.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:54:56] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2014.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:55:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[15:55:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "This seems like an improvement to me... are there any downsides to publishing these?" [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah)
[15:56:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2014.codfw.wmnet with OS bullseye
[15:56:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:42] <wikibugs>	 (03PS1) 10AOkoth: vrts: rename mail module class variables [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942)
[15:59:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Thanks so so much, @SCherukuwada, @jcrespo, @akosiaris, @Dzahn, @faidon , @MatthewVernon and everyone else who contributed to this! Hugely, hugely appreciated!!!!!! :) :)
[16:00:48] <wikibugs>	 (03PS2) 10Jbond: examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972
[16:01:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 (owner: 10Jbond)
[16:02:43] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:03:46] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34220/" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[16:07:21] <wikibugs>	 (03PS3) 10Jbond: examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972
[16:08:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 (owner: 10Jbond)
[16:09:09] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:13:19] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ayounsi) Yeah it's totally harmless, the downside is that DHCP won't work on hosts directly connected to cloudsw1-c8-eqiad.  The u...
[16:13:25] <wikibugs>	 (03PS7) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[16:14:13] <icinga-wm>	 RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:16:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) (owner: 10Jcrespo)
[16:18:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Downpref TI in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769994 (owner: 10Ayounsi)
[16:19:35] <wikibugs>	 (03Abandoned) 10Jcrespo: Check for server version and compare with xtrabackup [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/678299 (https://phabricator.wikimedia.org/T253959) (owner: 10Palak199)
[16:27:58] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] mirrors: Raise ssl ciphersuite strength [puppet] - 10https://gerrit.wikimedia.org/r/769780 (owner: 10JHathaway)
[16:28:11] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] mirrors: use @resolve for syncproxy2.wna.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/769782 (owner: 10JHathaway)
[16:33:33] <wikibugs>	 (03PS4) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238)
[16:33:51] <wikibugs>	 (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[16:35:17] <wikibugs>	 (03PS1) 10Ssingh: certspotter: add -start_at_end to only fetch new logs [puppet] - 10https://gerrit.wikimedia.org/r/770000 (https://phabricator.wikimedia.org/T303593)
[16:35:59] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34223/console" [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus)
[16:37:02] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] "Per my question on IRC, this will be hosted on https://tools-static.wmflabs.org/admin/fingerprints/. I believe hosting the fingerprints pu" [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah)
[16:38:30] <wikibugs>	 (03PS2) 10Ssingh: certspotter: add -start_at_end to only fetch new logs [puppet] - 10https://gerrit.wikimedia.org/r/770000 (https://phabricator.wikimedia.org/T303593)
[16:40:29] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34225/console" [puppet] - 10https://gerrit.wikimedia.org/r/770000 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[16:45:17] <thcipriani>	 nemo-yiannis: thanks for the ping! looks like you got everything resolved.
[16:47:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10thcipriani) >>! In T303450#7769389, @Joe wrote: > @thcipriani I guess we need your approval for this.  Approved! Needed now that im...
[16:49:12] <wikibugs>	 (03PS1) 10Jbond: reposync: dont catch RepoSyncNoChangeError [software/spicerack] - 10https://gerrit.wikimedia.org/r/770003
[16:52:34] <wikibugs>	 (03PS18) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397)
[16:53:46] <wikibugs>	 (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[16:54:36] <wikibugs>	 (03PS8) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[16:55:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[16:57:11] <wikibugs>	 (03PS5) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238)
[16:58:03] <wikibugs>	 (03PS1) 10Btullis: Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599)
[16:58:28] <wikibugs>	 (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[16:59:46] <wikibugs>	 (03PS19) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397)
[16:59:48] <wikibugs>	 (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[17:02:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[17:07:44] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: add -start_at_end to only fetch new logs [puppet] - 10https://gerrit.wikimedia.org/r/770000 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[17:10:16] <wikibugs>	 (03PS2) 10Btullis: Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599)
[17:15:08] <wikibugs>	 (03PS3) 10Btullis: Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599)
[17:16:19] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34229/console" [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis)
[17:29:33] <wikibugs>	 (03PS1) 10Ssingh: certspotter: re-enable systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593)
[17:30:52] <wikibugs>	 10SRE, 10VPS-project-Codesearch, 10Patch-For-Review: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10Ladsgroup) I leave it open for the rest of operations/software.
[17:31:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34230/console" [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[17:32:42] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 04-1] "Do not merge before Monday." [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[17:40:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[17:48:23] <wikibugs>	 (03PS4) 10DCausse: [wdqs] switch wdqs1010 to the streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108)
[17:56:08] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse)
[18:01:45] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy access_log_path to access loggers - https://phabricator.wikimedia.org/T303231 (10RLazarus)
[18:15:31] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. Good to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[18:18:28] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Just in case: This is not verifiable via mwdebug so deployers should take care to look for errors in the main mediawiki-errors dashboard i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[18:50:17] <wikibugs>	 (03PS1) 10Jcrespo: Improve logic and quality of life for remote backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/770021 (https://phabricator.wikimedia.org/T138562)
[18:55:29] <wikibugs>	 (03PS2) 10Jcrespo: Improve logic and quality of life for remote backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/770021 (https://phabricator.wikimedia.org/T138562)
[19:29:34] <wikibugs>	 (03PS3) 10Jcrespo: Improve logic and quality of life for remote backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/770021 (https://phabricator.wikimedia.org/T138562)
[19:35:41] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Migrate remote backup (snapshot) cmd line to 0.7 format [puppet] - 10https://gerrit.wikimedia.org/r/770023 (https://phabricator.wikimedia.org/T138562)
[19:36:37] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Wait for 0.7 wmfbackup-remote package deployment." [puppet] - 10https://gerrit.wikimedia.org/r/770023 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[21:10:49] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Thanks for patching this!  LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis)
[21:25:45] <wikibugs>	 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Platform Engineering, 10Wikimedia-Site-requests, and 3 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10sbassett) Given  > ... it would be ideal if Stewards could enable this themselves, w...
[21:28:49] <wikibugs>	 (03CR) 10Herron: "This change is ready for review." (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768108 (https://phabricator.wikimedia.org/T302842) (owner: 10Herron)
[21:30:59] <wikibugs>	 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Platform Engineering, 10Wikimedia-Site-requests, and 3 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10DannyS712) >>! In T303433#7770934, @sbassett wrote: > Given >  >> ... it would be id...
[21:40:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[22:21:13] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Thanks for this!  :rocket:" [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis)
[22:33:56] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:40:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] P:toolforge::static: publish SSH fingerprints under /admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah)
[22:45:27] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) >>! In T300130#7769545, @elukey wrote: > @colewhite I had a chat with John and the only current supported way is to have a self-hosted puppet master in the clou...
[23:29:22] <wikibugs>	 (03CR) 10Volans: Fix the prometheus elasticsearch exporter on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis)
[23:35:34] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook