[00:01:15] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:08] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:47] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:09] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:11] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:35] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:40] (03CR) 10Cwhite: [C: 03+2] hiera: add pki to logging env [puppet] - 10https://gerrit.wikimedia.org/r/769711 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite) [00:22:41] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:27:57] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:28:07] (03PS1) 10Cwhite: Revert "hiera: add pki to logging env" [puppet] - 10https://gerrit.wikimedia.org/r/769563 [00:31:04] (03CR) 10Cwhite: [C: 03+2] Revert "hiera: add pki to logging env" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769563 (owner: 10Cwhite) [00:33:17] !log on mwmaint1002 running populateGlobalEditCount.php [00:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:05] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:41:12] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/769563 | Reverted ]] due to puppet failures: # I think the cloud puppetmaster doesn't have a cert at `... [00:42:51] RECOVERY - BGP status on cr2-esams is OK: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:15:19] (03PS1) 10RLazarus: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) [01:25:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:14] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:04:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [02:09:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [02:12:19] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:26:07] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:34:09] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:34:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:36:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:07:23] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:29:22] (03PS1) 10Legoktm: Revert "Cache Badtitle 400s for 60s in varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/769827 [03:41:29] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:44:09] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:01:03] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:02:53] PROBLEM - Recursive DNS on 208.80.153.111 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [04:05:41] RECOVERY - Recursive DNS on 208.80.153.111 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [05:29:25] RECOVERY - Check systemd state on datahubsearch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:45] PROBLEM - Check systemd state on datahubsearch1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:45:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P22360 and previous config saved to /var/cache/conftool/dbconfig/20220311-054514-marostegui.json [05:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22361 and previous config saved to /var/cache/conftool/dbconfig/20220311-055409-root.json [05:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22362 and previous config saved to /var/cache/conftool/dbconfig/20220311-060913-root.json [06:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:14] (03CR) 10Marostegui: [C: 03+2] wmnet: Switchover m2-master [dns] - 10https://gerrit.wikimedia.org/r/769708 (owner: 10Marostegui) [06:13:40] !log Reboot dbproxy1014 T303174 [06:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:39] (03PS1) 10Marostegui: Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769828 [06:16:07] (03PS1) 10Marostegui: Revert "wmnet: Failover m1-master" [dns] - 10https://gerrit.wikimedia.org/r/769829 [06:16:16] (03CR) 10Marostegui: [C: 03+2] Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769828 (owner: 10Marostegui) [06:16:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [06:21:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [06:24:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22363 and previous config saved to /var/cache/conftool/dbconfig/20220311-062417-root.json [06:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:25] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:39:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22364 and previous config saved to /var/cache/conftool/dbconfig/20220311-063921-root.json [06:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:05] <_joe_> uh not sure if it's connected to the cr2-esams issue [07:09:13] <_joe_> btu I can't reach gerrit via ipv6 [07:09:35] <_joe_> marostegui: can you reach gerrit rn? [07:14:30] RECOVERY - BGP status on cr2-esams is OK: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:16:17] works for me _joe_ [07:20:57] <_joe_> marostegui: yeah for me too after the recovery [07:34:08] (03PS10) 10Giuseppe Lavagetto: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [07:34:10] (03PS6) 10Giuseppe Lavagetto: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [07:34:12] (03PS1) 10Giuseppe Lavagetto: P:cache::base: add netmapper file for abuse networks [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471) [07:34:14] (03PS1) 10Giuseppe Lavagetto: C:varnish: load abuse_networks.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769900 (https://phabricator.wikimedia.org/T302471) [07:34:16] (03PS1) 10Giuseppe Lavagetto: C:varnish: introduce the X-Abuse-Network request "header" [puppet] - 10https://gerrit.wikimedia.org/r/769901 (https://phabricator.wikimedia.org/T302471) [07:34:18] (03PS1) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 [07:53:34] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for uwsgi-netbox-scriptproxy [puppet] - 10https://gerrit.wikimedia.org/r/767834 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:55:52] (03PS19) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [07:58:33] (03PS2) 10Jcrespo: Add Cumin alias for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff) [07:58:42] (03PS3) 10Jcrespo: Add Cumin alias for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff) [07:59:30] (03PS4) 10Jcrespo: Add Cumin alias for mediabackups worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff) [07:59:52] (03PS5) 10Jcrespo: Add Cumin alias for mediabackup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220311T0800) [08:01:52] (03CR) 10Jcrespo: [C: 03+2] Add Cumin alias for mediabackup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff) [08:17:02] (03PS1) 10Muehlenhoff: Add profile::java to role::builder to install JDK 8/11 [puppet] - 10https://gerrit.wikimedia.org/r/769908 [08:18:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769908 (owner: 10Muehlenhoff) [08:19:08] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye [08:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:13] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host cloudvirt1017.eqiad.wmnet [08:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:42] (03CR) 10Muehlenhoff: [C: 03+2] Add profile::java to role::builder to install JDK 8/11 [puppet] - 10https://gerrit.wikimedia.org/r/769908 (owner: 10Muehlenhoff) [08:23:31] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirt1017.eqiad.wmnet [08:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:36] (03PS1) 10Muehlenhoff: Also enable component/jdk8 for bullseye, also present there [puppet] - 10https://gerrit.wikimedia.org/r/769909 [08:30:40] !log upgrade and restart db1145 [08:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:24] (03PS2) 10Muehlenhoff: Also enable component/jdk8 for bullseye, also present there [puppet] - 10https://gerrit.wikimedia.org/r/769909 [08:38:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769909 (owner: 10Muehlenhoff) [08:40:37] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ayounsi) I spent some time on cloudvirt1017 yesterday, I was able to confirm that: * When on the live host, with tcpdump, `sudo dh... [08:41:44] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host cloudvirt1017.eqiad.wmnet [08:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:00] !log upgrade and restart db2139 [08:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:21] (03CR) 10Muehlenhoff: [C: 03+2] Also enable component/jdk8 for bullseye, also present there [puppet] - 10https://gerrit.wikimedia.org/r/769909 (owner: 10Muehlenhoff) [08:43:47] !log restarting blazegraph on wdqs1012 (jvm stuck for 5hours) [08:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:02] (03PS1) 10Muehlenhoff: Revert "Also enable component/jdk8 for bullseye, also present there" [puppet] - 10https://gerrit.wikimedia.org/r/769910 [08:46:00] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Also enable component/jdk8 for bullseye, also present there" [puppet] - 10https://gerrit.wikimedia.org/r/769910 (owner: 10Muehlenhoff) [08:50:40] PROBLEM - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:50:41] ACKNOWLEDGEMENT - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T303585 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:50:44] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T303585 (10ops-monitoring-bot) [08:51:34] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: cleanup service unit file [puppet] - 10https://gerrit.wikimedia.org/r/769737 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [08:51:43] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirt1017.eqiad.wmnet [08:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:19] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [08:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:26] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [08:55:59] hmm... [08:57:44] (03CR) 10Vgutierrez: [C: 03+1] C:varnish: Load public-clouds.json via netmapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [09:00:05] (03CR) 10Vgutierrez: [C: 03+1] C:varnish: use X-Public-Cloud to store the cloud provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [09:00:26] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [09:00:51] !log kubernetes2011:~# systemctl restart rsyslog.service - T289766 [09:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:55] T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 [09:01:22] jayme: I am about to send a code change to reimage the node :D [09:15:16] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye [09:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:30] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [09:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:36] 10Puppet, 10Infrastructure-Foundations, 10observability, 10cloud-services-team (Kanban): 2 systemctl services failing on cloudcontrol hosts: prometheus-openstack-exporter and logrotate - https://phabricator.wikimedia.org/T303511 (10aborrero) [09:27:42] 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Joe) p:05Triage→03High @thcipriani I guess we need your approval for this. [09:29:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [09:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [09:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:46] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye [09:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:01] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [09:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:42:06] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10dcaro) I'm not very knowledgeable in this subject, so I'm probably not making sense but, some things that are not clear to me xd *... [09:42:06] PROBLEM - Device not healthy -SMART- on ganeti2013 is CRITICAL: cluster=ganeti device=sdb instance=ganeti2013 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ganeti2013&var-datasource=codfw+prometheus/ops [09:42:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [09:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [09:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:29] !log stopping certspotter on alert1001 [09:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:51] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ayounsi) Some progress: `name=install1003 DHCP discover,lines=20 09:19:22.791939 IP (tos 0x0, ttl 64, id 14579, offset 0, flags [n... [09:52:52] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:25] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1017.eqiad.wmnet with OS bullseye [09:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:52] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2011 [puppet] - 10https://gerrit.wikimedia.org/r/769919 (https://phabricator.wikimedia.org/T300744) [09:57:54] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2012 [puppet] - 10https://gerrit.wikimedia.org/r/769920 (https://phabricator.wikimedia.org/T300744) [09:57:58] jayme: --^ all yours :) [10:00:45] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs for kubernetes2011 [puppet] - 10https://gerrit.wikimedia.org/r/769919 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:01:18] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs for kubernetes2012 [puppet] - 10https://gerrit.wikimedia.org/r/769920 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:01:25] I really like those! [10:02:19] \o/ thanks! [10:02:25] going to prep for the 2011's reimage [10:02:28] 10SRE, 10Traffic-Icebox: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10Vgutierrez) [10:03:50] !log manually installed jvmquake to wdqs1010 (test machine) from https://people.wikimedia.org/~jmm/jvmquake/ [10:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:03] (03PS1) 10Jbond: C:varnish: drop carries netmapper config [puppet] - 10https://gerrit.wikimedia.org/r/769927 [10:04:25] (03PS1) 10Vgutierrez: certspotter: Temporarily disable certspotter [puppet] - 10https://gerrit.wikimedia.org/r/769928 (https://phabricator.wikimedia.org/T303593) [10:04:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [10:04:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [10:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [10:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [10:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:08] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes2011 [puppet] - 10https://gerrit.wikimedia.org/r/769919 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:06:30] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34204/console" [puppet] - 10https://gerrit.wikimedia.org/r/769928 (https://phabricator.wikimedia.org/T303593) (owner: 10Vgutierrez) [10:07:44] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] certspotter: Temporarily disable certspotter [puppet] - 10https://gerrit.wikimedia.org/r/769928 (https://phabricator.wikimedia.org/T303593) (owner: 10Vgutierrez) [10:08:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] P:tcpircbot: cleanup allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/768662 (owner: 10Majavah) [10:09:01] (03CR) 10Alexandros Kosiaris: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/768662 (owner: 10Majavah) [10:09:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [10:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:35] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2011.codfw.wmnet with OS bullseye [10:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Make k8s-ingress-wikikube page [puppet] - 10https://gerrit.wikimedia.org/r/767078 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:13:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Please also fix the docker images if they still need to 😊" [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) (owner: 10JMeybohm) [10:14:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [10:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:56] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:14:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:15:11] this is me reimaging kubernetes2011 --^ [10:16:26] (KubernetesCalicoDown) firing: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:16:29] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [10:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2110.codfw.wmnet with OS bullseye [10:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:55] (03CR) 10Jbond: C:varnish: Load public-clouds.json via netmapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [10:19:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [10:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:13] (03PS2) 10Phuedx: Remove unused wgWMESearchRelevancePages config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748 [10:21:10] (03PS7) 10Jbond: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 [10:21:30] (03PS9) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [10:22:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [10:23:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769900 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [10:24:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769901 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [10:24:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [10:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:49] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2011.codfw.wmnet with reason: host reimage [10:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:34] !log disable certspotter - T303593 [10:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:37] T303593: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 [10:26:41] (KubernetesCalicoDown) resolved: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:26:48] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10Vgutierrez) p:05Triage→03Medium [10:28:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2011.codfw.wmnet with reason: host reimage [10:28:11] (KubernetesCalicoDown) firing: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2110.codfw.wmnet with reason: host reimage [10:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:11] (KubernetesCalicoDown) resolved: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:33:46] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite I had a chat with John and the only current supported way is to have a self-hosted puppet master in the cloud project, so I am wondering if this is some... [10:34:11] (KubernetesCalicoDown) firing: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:34:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2110.codfw.wmnet with reason: host reimage [10:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [10:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:00] (03CR) 10Jbond: "there is also phabricator_abusers which is used in misc-frontend[1]" [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto) [10:38:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ayounsi) @Jclark-ctr let's hold on putting public hosts in the new rows for now. So ideally those would go to A-D. [10:39:01] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [10:39:03] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:11] (KubernetesCalicoDown) resolved: kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:40:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2011.codfw.wmnet with OS bullseye [10:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:05] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes2012 [puppet] - 10https://gerrit.wikimedia.org/r/769920 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:46:56] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2012.codfw.wmnet with OS bullseye [10:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2110.codfw.wmnet with OS bullseye [10:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:26] (KubernetesCalicoDown) firing: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:58:26] (KubernetesCalicoDown) firing: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:58:41] (KubernetesCalicoDown) resolved: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:59:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [10:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:05] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2012.codfw.wmnet with reason: host reimage [11:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:26] (KubernetesCalicoDown) firing: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:03:41] (KubernetesCalicoDown) resolved: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:05:11] (KubernetesCalicoDown) firing: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:05:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2012.codfw.wmnet with reason: host reimage [11:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:06] 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10elukey) [11:08:11] (03CR) 10Jbond: [C: 04-1] "See inline for nits, -1 is just for the leftover merge artefacts" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:09:16] 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10BTullis) Yes, I'm happy to shut down these nodes whenever @Cmjohnson prefers. [11:10:11] (KubernetesCalicoDown) resolved: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:10:30] (03PS25) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [11:10:48] (03CR) 10Jbond: [C: 04-1] varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:11:11] (KubernetesCalicoDown) firing: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:11:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [11:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:13:30] (03CR) 10Jbond: [C: 03+1] cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:13:37] (03CR) 10Jbond: [C: 03+1] cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:13:40] (03PS1) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:13:45] (03PS1) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [11:13:51] (03PS1) 10MVernon: swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) [11:14:13] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:14:29] (03CR) 10jerkins-bot: [V: 04-1] swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:15:16] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:16:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [11:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:40] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 133, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2012.codfw.wmnet with OS bullseye [11:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:26] (KubernetesCalicoDown) resolved: kubernetes2012.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:19:09] (03PS2) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:19:20] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:21:33] (03PS2) 10MVernon: swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) [11:26:50] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) So there's multiple issues here: According to the Dell support matrix : "H750/HBA350i/HBA355e require 20.04.2 minimum" https://linux.dell.com/files/supportmatrix/Ubuntu_LTS_Support_Ma... [11:26:52] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH [11:28:46] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I've updated the helm charts for datahub so that the secrets handling is compatible with our puppet based secret handling method. T... [11:32:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769780 (owner: 10JHathaway) [11:33:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769782 (owner: 10JHathaway) [11:33:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:36:53] (03PS2) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [11:38:40] (03PS3) 10Muehlenhoff: Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399) [11:40:03] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:40:26] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:41:00] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:42:19] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) [11:44:19] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I've updated the diagram to clarify the way that traffic is intended to flow within the deployment - i.e. requests to the GMS do not... [11:47:38] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10MoritzMuehlenhoff) [11:48:17] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10MoritzMuehlenhoff) 05Open→03Resolved cumin2002 is the active Cumin host in codfw, decommission of cumin2001 happens via https://phabricator.wikimedia.org/T303399 [11:51:54] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cumin2001.codfw.wmnet [11:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/769780 (owner: 10JHathaway) [11:55:58] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:55] (03PS3) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:58:25] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:59:52] (03PS4) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [12:00:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:38] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [12:03:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cumin2001.codfw.wmnet [12:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:16] (03CR) 10Cathal Mooney: [C: 03+2] Add several ASNs to those that alert as critical from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/767862 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:11:10] (03PS1) 10Cathal Mooney: Add new QFX switches in Eqiad row E/F to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/769950 (https://phabricator.wikimedia.org/T299758) [12:15:05] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10cmooney) @MatthewVernon apologies for the late reply, I've been only working part-time the last few days as I'd been ill. I think it is fine to proceed, but... [12:22:27] (03CR) 10Ayounsi: [C: 03+1] Add new QFX switches in Eqiad row E/F to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/769950 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:25:30] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) I'll carry out this work. I can also confirm that Njideka is WMF staff on the Data Engineering team and that she requires these pri... [12:27:22] (03CR) 10Cathal Mooney: [C: 03+2] Add new QFX switches in Eqiad row E/F to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/769950 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:27:25] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) [12:31:24] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) p:05Triage→03Medium [12:41:13] (03PS3) 10MVernon: swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) [12:42:00] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [12:42:33] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) [12:47:03] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) Technically, the procedure states that I'm also supposed to wait for @Ottomata to approve, although @Milimetric has also been approv... [12:48:27] (03PS1) 10Jbond: puppet_compiler: add support for the netbox-hiera repo [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/769953 [12:48:29] (03PS6) 10Jcrespo: Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) [12:56:56] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Marostegui) @MoritzMuehlenhoff @Volans I guess ^ means we also need to replace our raid monitoring tools? [13:03:39] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) >>! In T297913#7769796, @Marostegui wrote: > @MoritzMuehlenhoff @Volans I guess ^ means we also need to replace our raid monitoring tools? Yes, our monitoring calls the megacli binary... [13:04:06] (03CR) 10Jcrespo: Check that xtrabackup --prepare is using the same version (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) (owner: 10Jcrespo) [13:15:35] (03PS2) 10Marostegui: Revert "wmnet: Failover m1-master" [dns] - 10https://gerrit.wikimedia.org/r/769829 [13:16:55] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m1-master" [dns] - 10https://gerrit.wikimedia.org/r/769829 (owner: 10Marostegui) [13:19:16] (03PS1) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [13:19:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10ayounsi) [13:21:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:24:34] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:25:42] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) `cross-validate-accounts` exits without error. ` btullis@mwmaint1002:~$ cross-validate-accounts --username nokafor --uid 38462 --ema... [13:28:22] (03PS1) 10Btullis: Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) [13:29:37] (03CR) 10jerkins-bot: [V: 04-1] Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis) [13:30:47] (03PS2) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [13:30:49] (03PS2) 10Btullis: Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) [13:33:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10ayounsi) @Dale_Zhou I can't find any Wikitech user with the name "Dale_Zhou" see the instructions on https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer... [13:33:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10ayounsi) a:03Dale_Zhou [13:34:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123', diff saved to https://phabricator.wikimedia.org/P22366 and previous config saved to /var/cache/conftool/dbconfig/20220311-133407-marostegui.json [13:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10ayounsi) [13:36:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P22367 and previous config saved to /var/cache/conftool/dbconfig/20220311-133633-root.json [13:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:43:41] !log update pcc facts [13:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:38] (03PS1) 10Jbond: examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 [13:46:21] (03CR) 10jerkins-bot: [V: 04-1] examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 (owner: 10Jbond) [13:48:06] (03PS1) 10Ayounsi: Add shubhankar to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/769973 (https://phabricator.wikimedia.org/T303032) [13:49:25] !log dbmaint on s1@eqiad T298294 [13:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:29] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:49:39] !log dbmaint on s8@eqiad T300775 [13:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:42] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:51:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P22368 and previous config saved to /var/cache/conftool/dbconfig/20220311-135137-root.json [13:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:54] (03PS1) 10Giuseppe Lavagetto: varnish: allow injecting rate-limiting rules from hiera [puppet] - 10https://gerrit.wikimedia.org/r/769975 [13:56:20] (03PS2) 10Giuseppe Lavagetto: varnish: allow injecting rate-limiting rules from hiera [puppet] - 10https://gerrit.wikimedia.org/r/769975 [13:56:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/769973 (https://phabricator.wikimedia.org/T303032) (owner: 10Ayounsi) [13:56:43] (03CR) 10Ayounsi: [C: 03+2] Add shubhankar to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/769973 (https://phabricator.wikimedia.org/T303032) (owner: 10Ayounsi) [13:57:27] (03CR) 10Ayounsi: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/769973 (https://phabricator.wikimedia.org/T303032) (owner: 10Ayounsi) [14:03:29] (03PS1) 10Giuseppe Lavagetto: Add stub data for fe_ratelimit injection [labs/private] - 10https://gerrit.wikimedia.org/r/769977 (https://phabricator.wikimedia.org/T303534) [14:03:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10ayounsi) 05Open→03Resolved a:03ayounsi @ShubhankarP you should now have access, The doc on https://wikitech.wikimedia.org/wiki/SRE/Productio... [14:04:29] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add stub data for fe_ratelimit injection [labs/private] - 10https://gerrit.wikimedia.org/r/769977 (https://phabricator.wikimedia.org/T303534) (owner: 10Giuseppe Lavagetto) [14:04:59] (03PS1) 10Marostegui: Revert "db1143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769836 [14:05:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1170:3317', diff saved to https://phabricator.wikimedia.org/P22369 and previous config saved to /var/cache/conftool/dbconfig/20220311-140549-marostegui.json [14:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:02] (03CR) 10Marostegui: [C: 03+2] Revert "db1143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769836 (owner: 10Marostegui) [14:06:37] (03PS1) 10Elukey: Set bullseye + overlayfs settings for kubernetes2013 [puppet] - 10https://gerrit.wikimedia.org/r/769978 (https://phabricator.wikimedia.org/T300744) [14:06:39] (03PS1) 10Elukey: Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744) [14:06:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22370 and previous config saved to /var/cache/conftool/dbconfig/20220311-140641-root.json [14:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:37] (03PS1) 10Marostegui: Revert "db1144: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769837 [14:07:51] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34211/console" [puppet] - 10https://gerrit.wikimedia.org/r/769975 (owner: 10Giuseppe Lavagetto) [14:08:24] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:25] (03CR) 10Marostegui: [C: 03+2] Revert "db1144: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769837 (owner: 10Marostegui) [14:09:09] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34212/console" [puppet] - 10https://gerrit.wikimedia.org/r/769975 (owner: 10Giuseppe Lavagetto) [14:09:23] (03PS1) 10Marostegui: Revert "db1142: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769838 [14:09:44] (03PS1) 10Marostegui: Revert "db1141: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769839 [14:10:04] (03CR) 10Marostegui: [C: 03+2] Revert "db1142: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769838 (owner: 10Marostegui) [14:10:29] (03CR) 10Marostegui: [C: 03+2] Revert "db1141: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769839 (owner: 10Marostegui) [14:11:30] (03PS1) 10Marostegui: Revert "db2126,db2095: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769840 [14:12:31] (03CR) 10Marostegui: [C: 03+2] Revert "db2126,db2095: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769840 (owner: 10Marostegui) [14:13:05] (03CR) 10Jbond: [C: 03+2] motd::message: add new define for simple motd entries [puppet] - 10https://gerrit.wikimedia.org/r/765265 (owner: 10Jbond) [14:14:00] (03PS3) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [14:21:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P22371 and previous config saved to /var/cache/conftool/dbconfig/20220311-142147-root.json [14:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:53] (03PS1) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [14:23:24] (03CR) 10jerkins-bot: [V: 04-1] P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:25:01] (03PS4) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [14:27:40] (03PS2) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [14:27:45] (03PS5) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [14:28:04] (03PS3) 10Giuseppe Lavagetto: varnish: allow injecting rate-limiting rules from hiera [puppet] - 10https://gerrit.wikimedia.org/r/769975 (https://phabricator.wikimedia.org/T303534) [14:29:00] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs settings for kubernetes2013 [puppet] - 10https://gerrit.wikimedia.org/r/769978 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:29:14] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:30:44] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs settings for kubernetes2013 [puppet] - 10https://gerrit.wikimedia.org/r/769978 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:35:42] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2013.codfw.wmnet with OS bullseye [14:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P22372 and previous config saved to /var/cache/conftool/dbconfig/20220311-143652-root.json [14:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:12] (03PS6) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [14:40:59] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:21] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34218/console" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:43:26] (KubernetesCalicoDown) firing: kubernetes2013.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [14:45:43] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban, 10Patch-For-Review: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10Milimetric) I think Andrew just delegated to me for when he's out of town, but I approve! [14:48:09] (03PS1) 10Jgiannelos: Revert "mobileapps: Bump to 2022-03-10-175759-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769841 [14:49:46] (03CR) 10Vgutierrez: "overall LGTM (just fix the syntax error). It would be great if tests could be provided as part of modules/varnish/files/tests/text/09-anal" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [14:51:07] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2013.codfw.wmnet with reason: host reimage [14:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:19] Hi! Yesterdays deployment on mobileapps broke android on production. Can we deploy out of schedule today ? [14:51:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P22373 and previous config saved to /var/cache/conftool/dbconfig/20220311-145159-root.json [14:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:53] cc thcipriani here is the revert patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/769841 [14:54:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2013.codfw.wmnet with reason: host reimage [14:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:28] !log cr2-esams AVOID-PATHS as-path TI "6762 .*" [14:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:35] !log cr2-esams AVOID-PATHS as-path TI "6762 .*" <- rolled back [15:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:43] !log cr1/2-eqiad AVOID-PATHS as-path TI "6762 .*" [15:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:33] (03PS3) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [15:05:35] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:06:50] (03PS4) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [15:07:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P22374 and previous config saved to /var/cache/conftool/dbconfig/20220311-150702-root.json [15:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:22] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host kubernetes2013.codfw.wmnet with OS bullseye [15:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:26] (KubernetesCalicoDown) resolved: kubernetes2013.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:12:14] uff failed? [15:12:53] ah Failed to get Netbox script results, try manually [15:13:11] PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:13:34] and https://netbox.wikimedia.org/api/extras/job-results/2654017/ doesn't look good [15:13:57] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:47] (03PS2) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) [15:15:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10Dale_Zhou) [15:16:22] (03CR) 10CDanis: [C: 03+1] "😅" [puppet] - 10https://gerrit.wikimedia.org/r/769975 (https://phabricator.wikimedia.org/T303534) (owner: 10Giuseppe Lavagetto) [15:17:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: allow injecting rate-limiting rules from hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769975 (https://phabricator.wikimedia.org/T303534) (owner: 10Giuseppe Lavagetto) [15:18:58] (03PS5) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [15:19:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10Dale_Zhou) >>! In T303031#7769858, @ayounsi wrote: > @Dale_Zhou I can't find any Wikitech user with the name "Dale_Zhou" see the instructions on https://wikitech.wikimedi... [15:20:34] nemo-yiannis: o/ if the issue is user facing and impacts users right now, I think that a deploy makes sense. [15:20:50] do you have all sign-offs for the revert? [15:21:46] ah it is a revert in the docker image, yeah I think it is fine in my opinion [15:21:56] Yeah its introducing some UI changes that don't work on android. I can merge the patch. [15:22:05] maybe it would be good to get a +1 from somebody else [15:22:25] sure [15:23:49] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:23:51] (03PS6) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [15:23:53] (03PS1) 10Jbond: motd::message: use the title as the message by default [puppet] - 10https://gerrit.wikimedia.org/r/769992 [15:23:55] (03PS2) 10Elukey: Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744) [15:23:57] (03CR) 10Elukey: [V: 03+2 C: 03+2] Set bullseye + overlayfs settings for kubernetes2014 [puppet] - 10https://gerrit.wikimedia.org/r/769979 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:24:00] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:53] (03CR) 10jerkins-bot: [V: 04-1] motd::message: use the title as the message by default [puppet] - 10https://gerrit.wikimedia.org/r/769992 (owner: 10Jbond) [15:24:55] (03PS1) 10Btullis: Allow access to MariaDB analytics-meta from Kubernetes pods [puppet] - 10https://gerrit.wikimedia.org/r/769993 (https://phabricator.wikimedia.org/T303049) [15:26:00] (03PS2) 10Jbond: motd::message: use the title as the message by default [puppet] - 10https://gerrit.wikimedia.org/r/769992 [15:26:07] (03PS7) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [15:27:08] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2014.codfw.wmnet with OS bullseye [15:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:38] (03CR) 10Isabelle Hurbain-Palatin: [C: 03+1] "I checked that this is indeed a revert of the indicated commit, and that the indicated commit is the previous one in the chain, which woul" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769841 (owner: 10Jgiannelos) [15:27:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow access to MariaDB analytics-meta from Kubernetes pods [puppet] - 10https://gerrit.wikimedia.org/r/769993 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:30:40] (03CR) 10Jgiannelos: [C: 03+2] Revert "mobileapps: Bump to 2022-03-10-175759-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769841 (owner: 10Jgiannelos) [15:31:15] (03CR) 10Jbond: [C: 03+2] motd::message: use the title as the message by default [puppet] - 10https://gerrit.wikimedia.org/r/769992 (owner: 10Jbond) [15:32:06] elukey: thanks, deploying now [15:32:18] (03CR) 10Btullis: [C: 03+2] Allow access to MariaDB analytics-meta from Kubernetes pods [puppet] - 10https://gerrit.wikimedia.org/r/769993 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:33:02] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:05] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:26] (KubernetesCalicoDown) firing: kubernetes2014.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:34:56] (03Merged) 10jenkins-bot: Revert "mobileapps: Bump to 2022-03-10-175759-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769841 (owner: 10Jgiannelos) [15:35:04] (03PS1) 10Ayounsi: Downpref TI in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769994 [15:35:51] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:17] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:24] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:59] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:45] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:41] (03PS2) 10Jbond: puppet_compiler: add support for the netbox-hiera repo [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/769953 [15:38:49] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:43] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:14] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:41:32] (03PS1) 10Ayounsi: Add dalezhou to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/769996 (https://phabricator.wikimedia.org/T303031) [15:42:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2014.codfw.wmnet with reason: host reimage [15:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:58] (03PS1) 10Jbond: puppet_compilers: bump to puppet-compiler version 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/769997 [15:44:04] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) > ** Easiest workaround, run `cloudsw1-c8-eqiad# deactivate vlans cloud-hosts1-eqiad forwarding-options dhcp-security opti... [15:44:09] (03CR) 10Jbond: [C: 03+2] puppet_compiler: add support for the netbox-hiera repo [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/769953 (owner: 10Jbond) [15:44:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2014.codfw.wmnet with reason: host reimage [15:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:16] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:25] (03Merged) 10jenkins-bot: puppet_compiler: add support for the netbox-hiera repo [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/769953 (owner: 10Jbond) [15:47:10] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:48:39] (03CR) 10Jbond: [C: 03+2] puppet_compilers: bump to puppet-compiler version 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/769997 (owner: 10Jbond) [15:49:26] (KubernetesCalicoDown) resolved: kubernetes2014.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:49:56] (KubernetesCalicoDown) firing: kubernetes2014.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:54:56] (KubernetesCalicoDown) resolved: kubernetes2014.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:55:22] (03CR) 10Vgutierrez: [C: 04-1] Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [15:55:34] (03CR) 10Andrew Bogott: [C: 03+1] "This seems like an improvement to me... are there any downsides to publishing these?" [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah) [15:56:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2014.codfw.wmnet with OS bullseye [15:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:42] (03PS1) 10AOkoth: vrts: rename mail module class variables [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) [15:59:54] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Thanks so so much, @SCherukuwada, @jcrespo, @akosiaris, @Dzahn, @faidon , @MatthewVernon and everyone else who contributed to this! Hugely, hugely appreciated!!!!!! :) :) [16:00:48] (03PS2) 10Jbond: examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 [16:01:27] (03CR) 10jerkins-bot: [V: 04-1] examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 (owner: 10Jbond) [16:02:43] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:03:46] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34220/" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [16:07:21] (03PS3) 10Jbond: examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 [16:08:52] (03CR) 10Jbond: [C: 03+2] examples: add some example spec files [puppet] - 10https://gerrit.wikimedia.org/r/769972 (owner: 10Jbond) [16:09:09] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:13:19] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ayounsi) Yeah it's totally harmless, the downside is that DHCP won't work on hosts directly connected to cloudsw1-c8-eqiad. The u... [16:13:25] (03PS7) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [16:14:13] RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:16:36] (03CR) 10Jcrespo: [C: 03+2] Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) (owner: 10Jcrespo) [16:18:24] (03CR) 10Ayounsi: [C: 03+2] Downpref TI in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769994 (owner: 10Ayounsi) [16:19:35] (03Abandoned) 10Jcrespo: Check for server version and compare with xtrabackup [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/678299 (https://phabricator.wikimedia.org/T253959) (owner: 10Palak199) [16:27:58] (03CR) 10JHathaway: [C: 03+2] mirrors: Raise ssl ciphersuite strength [puppet] - 10https://gerrit.wikimedia.org/r/769780 (owner: 10JHathaway) [16:28:11] (03CR) 10JHathaway: [C: 03+2] mirrors: use @resolve for syncproxy2.wna.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/769782 (owner: 10JHathaway) [16:33:33] (03PS4) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [16:33:51] (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [16:35:17] (03PS1) 10Ssingh: certspotter: add -start_at_end to only fetch new logs [puppet] - 10https://gerrit.wikimedia.org/r/770000 (https://phabricator.wikimedia.org/T303593) [16:35:59] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34223/console" [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus) [16:37:02] (03CR) 10Nskaggs: [C: 03+1] "Per my question on IRC, this will be hosted on https://tools-static.wmflabs.org/admin/fingerprints/. I believe hosting the fingerprints pu" [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah) [16:38:30] (03PS2) 10Ssingh: certspotter: add -start_at_end to only fetch new logs [puppet] - 10https://gerrit.wikimedia.org/r/770000 (https://phabricator.wikimedia.org/T303593) [16:40:29] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34225/console" [puppet] - 10https://gerrit.wikimedia.org/r/770000 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [16:45:17] nemo-yiannis: thanks for the ping! looks like you got everything resolved. [16:47:28] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10thcipriani) >>! In T303450#7769389, @Joe wrote: > @thcipriani I guess we need your approval for this. Approved! Needed now that im... [16:49:12] (03PS1) 10Jbond: reposync: dont catch RepoSyncNoChangeError [software/spicerack] - 10https://gerrit.wikimedia.org/r/770003 [16:52:34] (03PS18) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [16:53:46] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [16:54:36] (03PS8) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [16:55:22] (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [16:57:11] (03PS5) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [16:58:03] (03PS1) 10Btullis: Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) [16:58:28] (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [16:59:46] (03PS19) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [16:59:48] (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [17:02:38] (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [17:07:44] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: add -start_at_end to only fetch new logs [puppet] - 10https://gerrit.wikimedia.org/r/770000 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [17:10:16] (03PS2) 10Btullis: Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) [17:15:08] (03PS3) 10Btullis: Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) [17:16:19] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34229/console" [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis) [17:29:33] (03PS1) 10Ssingh: certspotter: re-enable systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) [17:30:52] 10SRE, 10VPS-project-Codesearch, 10Patch-For-Review: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10Ladsgroup) I leave it open for the rest of operations/software. [17:31:59] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34230/console" [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [17:32:42] (03CR) 10Ssingh: [V: 03+1 C: 04-1] "Do not merge before Monday." [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [17:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:48:23] (03PS4) 10DCausse: [wdqs] switch wdqs1010 to the streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) [17:56:08] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [18:01:45] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy access_log_path to access loggers - https://phabricator.wikimedia.org/T303231 (10RLazarus) [18:15:31] (03CR) 10Krinkle: [C: 03+1] "LGTM. Good to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:18:28] (03CR) 10Krinkle: [C: 03+1] "Just in case: This is not verifiable via mwdebug so deployers should take care to look for errors in the main mediawiki-errors dashboard i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:50:17] (03PS1) 10Jcrespo: Improve logic and quality of life for remote backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/770021 (https://phabricator.wikimedia.org/T138562) [18:55:29] (03PS2) 10Jcrespo: Improve logic and quality of life for remote backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/770021 (https://phabricator.wikimedia.org/T138562) [19:29:34] (03PS3) 10Jcrespo: Improve logic and quality of life for remote backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/770021 (https://phabricator.wikimedia.org/T138562) [19:35:41] (03PS1) 10Jcrespo: dbbackups: Migrate remote backup (snapshot) cmd line to 0.7 format [puppet] - 10https://gerrit.wikimedia.org/r/770023 (https://phabricator.wikimedia.org/T138562) [19:36:37] (03CR) 10Jcrespo: [C: 04-1] "Wait for 0.7 wmfbackup-remote package deployment." [puppet] - 10https://gerrit.wikimedia.org/r/770023 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [21:10:49] (03CR) 10Herron: [C: 03+1] "Thanks for patching this! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis) [21:25:45] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Platform Engineering, 10Wikimedia-Site-requests, and 3 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10sbassett) Given > ... it would be ideal if Stewards could enable this themselves, w... [21:28:49] (03CR) 10Herron: "This change is ready for review." (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768108 (https://phabricator.wikimedia.org/T302842) (owner: 10Herron) [21:30:59] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Platform Engineering, 10Wikimedia-Site-requests, and 3 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10DannyS712) >>! In T303433#7770934, @sbassett wrote: > Given > >> ... it would be id... [21:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:21:13] (03CR) 10Cwhite: [C: 03+1] "Thanks for this! :rocket:" [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis) [22:33:56] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:40:14] (03CR) 10Andrew Bogott: [C: 03+1] P:toolforge::static: publish SSH fingerprints under /admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah) [22:45:27] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) >>! In T300130#7769545, @elukey wrote: > @colewhite I had a chat with John and the only current supported way is to have a self-hosted puppet master in the clou... [23:29:22] (03CR) 10Volans: Fix the prometheus elasticsearch exporter on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis) [23:35:34] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook