[00:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:39:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910542 [00:39:16] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910542 (owner: 10TrainBranchBot) [00:55:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910542 (owner: 10TrainBranchBot) [01:20:11] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:35] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:27:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:28:13] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:28:45] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:30:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:49:17] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [02:09:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [02:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:16:09] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:20:23] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [02:30:31] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [02:47:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [02:48:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [03:00:27] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:47] I'm having trouble loading https://lists.wikimedia.org/ [03:28:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:29:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:31:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.793 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:32:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:47:37] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:17:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [05:18:07] (ProbeDown) firing: (4) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:18:44] Response time just jumped out the blue, but looks like already recovered [05:20:56] oh no, still some lag on some wikis [05:21:10] Original error: upstream connect error or disconnect/reset before headers. reset reason: overflow [05:21:15] getting full errors now [05:21:19] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:55] Got the page but unable to be in front of a PC right now [05:22:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [05:23:03] (ProbeDown) firing: (10) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:07] (ProbeDown) resolved: (4) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:43] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [05:23:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:28:03] (ProbeDown) resolved: (11) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:44] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [05:28:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:49:17] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:12:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [06:14:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [06:14:18] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:32:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [06:35:18] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [06:52:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [06:53:19] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [06:54:54] (03PS4) 10Anzx: Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910018 (https://phabricator.wikimedia.org/T335090) [07:00:05] Amir1, Urbanecm, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T0700). [07:00:05] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:06:16] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab1004.wikimedia.org with OS bullseye [07:12:32] (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:17:32] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:21:37] !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [07:22:34] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:24:21] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [07:35:47] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:36:16] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.4a in codfw [07:38:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.4a in codfw [07:39:36] (03PS1) 10Muehlenhoff: Remove access for hshaath,hghani,ilooremeta [puppet] - 10https://gerrit.wikimedia.org/r/911247 [07:39:43] !log restarting blazegraph on wdqs1005 (stuck for 3+days) [07:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:05] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.258 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:42:14] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.59 in codfw [07:42:43] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:43:28] o/ can someone reboot wdqs1015 from the console (seems completely stuck, can't ssh)? [07:44:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.59 in codfw [07:45:29] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1004.wikimedia.org with OS bullseye [07:48:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:49:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for hshaath,hghani,ilooremeta [puppet] - 10https://gerrit.wikimedia.org/r/911247 (owner: 10Muehlenhoff) [07:49:55] wdqs1005 alert is expected (the machine has been depooled while it's catching up on lag) [07:51:23] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.41 in codfw [07:53:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.41 in codfw [08:03:07] (03PS1) 10Slyngshede: Requisition approval functionality. [software/bitu] - 10https://gerrit.wikimedia.org/r/911249 [08:05:32] !log Enable replication eqiad -> codfw on pc1 dbmaint eqiad T335266 [08:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:38] T335266: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T335266 [08:05:38] jouncebot: nowandnext [08:05:38] No deployments scheduled for the next 1 hour(s) and 54 minute(s) [08:05:38] In 1 hour(s) and 54 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T1000) [08:06:45] !log Enable replication eqiad -> codfw on pc2 dbmaint eqiad T335266 [08:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:22] !log Enable replication eqiad -> codfw on pc3 dbmaint eqiad T335266 [08:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:15] (03CR) 10Muehlenhoff: pybal/lvs: remove backward compatibility for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [08:08:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 6 hosts with reason: Enabling replication T335266 [08:08:42] !log Enable replication eqiad -> codfw on es4 dbmaint eqiad T335266 [08:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 6 hosts with reason: Enabling replication T335266 [08:09:13] (03CR) 10Klausman: [C: 03+1] cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 (owner: 10Clément Goubert) [08:09:33] !log Deploying 909302 on deploy1002 for T329857 [08:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:38] T329857: MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 [08:10:29] !log Disabling puppet on deploy2002 - T329857 [08:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:36] (03CR) 10Clément Goubert: [C: 03+2] Enable /srv/mediawiki symlink on prod deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/909302 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [08:12:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:14:02] (03PS1) 10Muehlenhoff: Extend access for appledora [puppet] - 10https://gerrit.wikimedia.org/r/911250 [08:14:23] !log Deploying 909302 on deploy2002 for T329857 [08:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:05] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for appledora [puppet] - 10https://gerrit.wikimedia.org/r/911250 (owner: 10Muehlenhoff) [08:17:10] !log Enable replication eqiad -> codfw on es5 dbmaint eqiad T335266 [08:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:15] T335266: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T335266 [08:18:16] !log cgoubert@deploy2002 Started scap: testing T329857 [08:18:21] T329857: MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 [08:20:09] !log Enable replication eqiad -> codfw on x1 dbmaint eqiad T335266 [08:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 10 hosts with reason: Enabling replication T335266 [08:20:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 10 hosts with reason: Enabling replication T335266 [08:21:42] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-worker1110.eqiad.wmnet with reason: Upgrading RAID controller firmware [08:21:53] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:21:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-worker1110.eqiad.wmnet with reason: Upgrading RAID controller firmware [08:22:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:23:33] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:25:22] !log btullis@cumin1001 START - Cookbook sre.hosts.dhcp for host an-worker1110.eqiad.wmnet [08:26:50] !log Enable replication eqiad -> codfw on s2 dbmaint eqiad T335266 [08:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:57] T335266: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T335266 [08:26:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 27 hosts with reason: Enabling replication T335266 [08:27:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 27 hosts with reason: Enabling replication T335266 [08:28:14] !log Enable replication eqiad -> codfw on s6 dbmaint eqiad T335266 [08:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 27 hosts with reason: Enabling replication T335266 [08:29:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 27 hosts with reason: Enabling replication T335266 [08:31:43] (03CR) 10Marostegui: "Just to confirm, this is not the password we are using right?" [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [08:32:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 26 hosts with reason: Enabling replication T335266 [08:32:05] T335266: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T335266 [08:32:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 26 hosts with reason: Enabling replication T335266 [08:32:45] !log cgoubert@deploy2002 Finished scap: testing T329857 (duration: 14m 29s) [08:32:51] T329857: MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 [08:33:01] !log Enable replication eqiad -> codfw on s5 dbmaint eqiad T335266 [08:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:43:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 34 hosts with reason: Enabling replication T335266 [08:43:59] T335266: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T335266 [08:44:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 34 hosts with reason: Enabling replication T335266 [08:44:48] !log Enable replication eqiad -> codfw on s8 dbmaint eqiad T335266 [08:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:35] !log uploaded php-excimer 1.0.2-1+wmf3+buster1 (which rebases Excimer to 1.1.1) to component/php74 for buster-wikimedia T332964 [08:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:41] T332964: Upgrade php-excimer package from 1.0.4 to 1.1.1 - https://phabricator.wikimedia.org/T332964 [08:48:14] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [08:49:37] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:13] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:53:33] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40797/console" [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [08:56:03] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:56:42] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [08:57:39] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:03:37] PROBLEM - mediawiki-installation DSH group on deploy1002 is CRITICAL: Host deploy1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:07:27] 10SRE-Access-Requests: Update production ssh public key for jcrespo - https://phabricator.wikimedia.org/T335269 (10jcrespo) [09:07:41] 10SRE-Access-Requests: Update production ssh public key for jcrespo - https://phabricator.wikimedia.org/T335269 (10jcrespo) p:05Triage→03Medium [09:08:38] (03PS4) 10Aqu: analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) [09:10:55] (03CR) 10Aqu: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [09:11:43] PROBLEM - mediawiki-installation DSH group on deploy2002 is CRITICAL: Host deploy2002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:11:54] ^expected [09:12:20] (03PS1) 10Jcrespo: admin: Update production ssh public key for jcrespo [puppet] - 10https://gerrit.wikimedia.org/r/911254 (https://phabricator.wikimedia.org/T335269) [09:14:21] (03CR) 10Jcrespo: "Given it is an access-related patch I will wait for clinic duty and foundations feedback." [puppet] - 10https://gerrit.wikimedia.org/r/911254 (https://phabricator.wikimedia.org/T335269) (owner: 10Jcrespo) [09:15:19] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:55] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:20:27] (03PS10) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [09:21:18] !log upgrade php-excimer on mw canaries to 1.0.2-1+wmf3+buster1 (which rebases Excimer to 1.1.1) T332964 [09:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:24] T332964: Upgrade php-excimer package from 1.0.4 to 1.1.1 - https://phabricator.wikimedia.org/T332964 [09:21:46] (03PS2) 10Vgutierrez: varnish: Allow disabling port 80 [puppet] - 10https://gerrit.wikimedia.org/r/907824 (https://phabricator.wikimedia.org/T322774) [09:23:27] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) [09:23:40] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40798/console" [puppet] - 10https://gerrit.wikimedia.org/r/907824 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [09:23:44] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) [09:25:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-worker1110.eqiad.wmnet [09:26:39] (03CR) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [09:26:43] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:26:59] (03PS1) 10Clément Goubert: P:mediawiki::common: Remove deploy check_dsh_group [puppet] - 10https://gerrit.wikimedia.org/r/911259 (https://phabricator.wikimedia.org/T329857) [09:29:22] (03PS2) 10Clément Goubert: P:mediawiki::common: Remove deploy check_dsh_group [puppet] - 10https://gerrit.wikimedia.org/r/911259 (https://phabricator.wikimedia.org/T329857) [09:30:30] (03PS3) 10Clément Goubert: P:mediawiki::common: Remove deploy check_dsh_group [puppet] - 10https://gerrit.wikimedia.org/r/911259 (https://phabricator.wikimedia.org/T329857) [09:30:39] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911259 (https://phabricator.wikimedia.org/T329857) (owner: 10Clément Goubert) [09:31:41] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:33:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [09:34:47] (03PS5) 10Aqu: analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) [09:36:39] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:41:33] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:43:05] (03CR) 10Btullis: [C: 03+2] Add a custom ceph_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [09:44:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/911254 (https://phabricator.wikimedia.org/T335269) (owner: 10Jcrespo) [09:46:02] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [09:46:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [09:49:00] (03Merged) 10jenkins-bot: [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [09:49:17] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:51:33] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:54:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 28 hosts with reason: Enabling replication T335266 [09:54:34] T335266: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T335266 [09:55:07] !log Enable replication eqiad -> codfw on s7 dbmaint eqiad T335266 [09:55:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 28 hosts with reason: Enabling replication T335266 [09:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] !log Update LDAP schema wmf-user: T148048 [09:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:56] T148048: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 [09:57:45] (03CR) 10Slyngshede: [C: 03+2] P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [09:59:21] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/910758 (https://phabricator.wikimedia.org/T335181) (owner: 10Majavah) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T1000) [10:01:23] !log installing git security updates [10:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:13] (03CR) 10Marostegui: [C: 03+1] admin: Update production ssh public key for jcrespo [puppet] - 10https://gerrit.wikimedia.org/r/911254 (https://phabricator.wikimedia.org/T335269) (owner: 10Jcrespo) [10:05:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update production ssh public key for jcrespo - https://phabricator.wikimedia.org/T335269 (10Marostegui) @jcrespo I assume you self serve? [10:06:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 24 hosts with reason: Enabling replication T335266 [10:06:31] T335266: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T335266 [10:06:38] !log Enable replication eqiad -> codfw on s3 dbmaint eqiad T335266 [10:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 24 hosts with reason: Enabling replication T335266 [10:07:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 35 hosts with reason: Enabling replication T335266 [10:07:58] !log Enable replication eqiad -> codfw on s4 dbmaint eqiad T335266 [10:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:11] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:08:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 35 hosts with reason: Enabling replication T335266 [10:09:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 38 hosts with reason: Enabling replication T335266 [10:10:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 38 hosts with reason: Enabling replication T335266 [10:11:38] !log Enable replication eqiad -> codfw on s1 dbmaint eqiad T335266 [10:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:43] T335266: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T335266 [10:12:33] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [10:14:18] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:14:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [10:14:36] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Hibashaath out of all services on: 1262 hosts [10:14:51] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:16:31] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:16:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Hibashaath out of all services on: 1262 hosts [10:17:42] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Hibashaath out of all services on: 801 hosts [10:17:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Hibashaath out of all services on: 801 hosts [10:18:10] (03PS3) 10Vgutierrez: varnish: Allow disabling port 80 [puppet] - 10https://gerrit.wikimedia.org/r/907824 (https://phabricator.wikimedia.org/T322774) [10:18:13] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Hghani out of all services on: 801 hosts [10:18:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Hghani out of all services on: 801 hosts [10:19:02] (03PS1) 10Marostegui: install_server: Do not reimage db1212 [puppet] - 10https://gerrit.wikimedia.org/r/911263 [10:19:33] (03CR) 10Marostegui: [C: 03+1] mariadb: Add lists1003 grants for mailman dbs [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) (owner: 10Ladsgroup) [10:19:46] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1212 [puppet] - 10https://gerrit.wikimedia.org/r/911263 (owner: 10Marostegui) [10:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:20:29] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Hghani out of all services on: 1262 hosts [10:20:34] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40799/console" [puppet] - 10https://gerrit.wikimedia.org/r/907824 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:22:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Hghani out of all services on: 1262 hosts [10:22:43] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Ilooremeta out of all services on: 1262 hosts [10:24:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ilooremeta out of all services on: 1262 hosts [10:26:31] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Ilooremeta out of all services on: 801 hosts [10:26:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ilooremeta out of all services on: 801 hosts [10:27:33] (03PS4) 10Vgutierrez: varnish: Allow disabling port 80 [puppet] - 10https://gerrit.wikimedia.org/r/907824 (https://phabricator.wikimedia.org/T322774) [10:27:40] !log Datacenter switchover live testing setting db to read-only and back in eqiad - T327920 [10:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:45] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [10:27:53] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:28:58] 10SRE, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth, 10Mobile: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out stat... - https://phabricator.wikimedia.org/T335125 [10:29:07] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [10:29:38] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [10:29:40] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [10:29:43] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [10:29:58] !log Datacenter switchover live testing setting db to read-only and back in eqiad successful - T327920 [10:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:17] 10SRE, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth, 10Mobile: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out stat... - https://phabricator.wikimedia.org/T335125 [10:32:32] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [10:35:33] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [10:36:49] (03CR) 10Jbond: [C: 03+2] puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:36:59] (03CR) 10Jbond: [C: 03+2] environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond) [10:37:07] 10SRE-swift-storage: Document the process for making new-style storage nodes - https://phabricator.wikimedia.org/T335274 (10MatthewVernon) [10:38:02] (03CR) 10Jbond: [C: 03+2] core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:38:14] (03CR) 10Jbond: [C: 03+2] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [10:38:14] 10SRE-swift-storage, 10Thumbor, 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly - https://phabricator.wikimedia.org/T335271 (10Snaevar) [10:45:39] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:45:52] (03CR) 10FNegri: [C: 03+2] ToolsDB: remove replication filters [puppet] - 10https://gerrit.wikimedia.org/r/909695 (https://phabricator.wikimedia.org/T328691) (owner: 10FNegri) [10:47:43] 10SRE-swift-storage: Create new storage scheme entries for larger disks_by_path swift backends - https://phabricator.wikimedia.org/T335275 (10MatthewVernon) [10:48:05] (03CR) 10Jbond: [C: 03+1] sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:49:03] (03CR) 10Jcrespo: [C: 03+2] admin: Update production ssh public key for jcrespo [puppet] - 10https://gerrit.wikimedia.org/r/911254 (https://phabricator.wikimedia.org/T335269) (owner: 10Jcrespo) [10:49:51] 10ops-codfw, 10DBA: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) Yes, let's wait for the DC switchover. So we can sync on this next week. [10:50:31] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:52:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [10:53:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [10:54:13] 10SRE-swift-storage: Bring ms-be207[0-3] into the rings - https://phabricator.wikimedia.org/T335278 (10MatthewVernon) [10:54:45] 10SRE-swift-storage: Bring ms-be207[0-3] into the rings - https://phabricator.wikimedia.org/T335278 (10MatthewVernon) [10:54:47] 10SRE-swift-storage: Create new storage scheme entries for larger disks_by_path swift backends - https://phabricator.wikimedia.org/T335275 (10MatthewVernon) [10:56:03] !log deployed new ssh key for jcrespo on production cluster [10:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:19] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/910041 [10:56:21] (03CR) 10Muehlenhoff: [C: 03+2] sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:56:30] 10SRE-swift-storage: Bring ms-be107[2-5] into the rings - https://phabricator.wikimedia.org/T335279 (10MatthewVernon) [10:56:47] 10SRE-swift-storage: Create new storage scheme entries for larger disks_by_path swift backends - https://phabricator.wikimedia.org/T335275 (10MatthewVernon) [10:56:49] 10SRE-swift-storage: Bring ms-be107[2-5] into the rings - https://phabricator.wikimedia.org/T335279 (10MatthewVernon) [10:58:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update production ssh public key for jcrespo - https://phabricator.wikimedia.org/T335269 (10jcrespo) 05Open→03Resolved >>! In T335269#8800709, @Marostegui wrote: > @jcrespo I assume you self serve? Yep. Deployed. Going to a meeting while the new key (al... [10:58:10] 10SRE-swift-storage: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280 (10MatthewVernon) [10:58:59] 10SRE-swift-storage: Drain and then decommission ms-be10[40-43] - https://phabricator.wikimedia.org/T335281 (10MatthewVernon) [11:01:12] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/910041 (owner: 10Jbond) [11:03:30] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/910041 (owner: 10Jbond) [11:05:27] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:10:25] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:11:09] (03CR) 10FNegri: [C: 03+1] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909397 (https://phabricator.wikimedia.org/T334925) (owner: 10BryanDavis) [11:12:03] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:13:24] !log Fixing appserver clusters canary weights [11:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:45] 10SRE, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth, 10Mobile: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out stat... - https://phabricator.wikimedia.org/T335125 [11:13:53] (03PS3) 10Jbond: git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 [11:13:59] !log cgoubert@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=parsoid,service=canary [11:14:14] (03CR) 10CI reject: [V: 04-1] git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 (owner: 10Jbond) [11:14:31] !log cgoubert@cumin1001 conftool action : set/weight=10; selector: dc=codfw,cluster=parsoid,service=canary [11:14:34] 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10Marostegui) p:05Triage→03Medium a:03HShaikh Assigning to @HShaikh as we are waiting for an answer for the explanation provided at T334752#8787305 [11:15:48] (03PS4) 10Jbond: git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 [11:15:56] (03PS66) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [11:16:36] (03CR) 10CI reject: [V: 04-1] git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 (owner: 10Jbond) [11:16:51] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:17:38] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [11:18:52] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:19:32] !log cgoubert@cumin1001 conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=canary [11:19:54] (03PS5) 10Jbond: git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 [11:20:15] (03CR) 10CI reject: [V: 04-1] git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 (owner: 10Jbond) [11:21:18] !log cgoubert@cumin1001 conftool action : set/weight=25; selector: dc=codfw,cluster=appserver,service=canary [11:21:33] (03PS1) 10Marostegui: data.yaml: Add Ccoxwell [puppet] - 10https://gerrit.wikimedia.org/r/911277 (https://phabricator.wikimedia.org/T335150) [11:22:07] (03CR) 10Jaime Nuche: [C: 03+1] P:mediawiki::common: Remove deploy check_dsh_group [puppet] - 10https://gerrit.wikimedia.org/r/911259 (https://phabricator.wikimedia.org/T329857) (owner: 10Clément Goubert) [11:22:26] (03CR) 10CI reject: [V: 04-1] data.yaml: Add Ccoxwell [puppet] - 10https://gerrit.wikimedia.org/r/911277 (https://phabricator.wikimedia.org/T335150) (owner: 10Marostegui) [11:23:16] !log cgoubert@cumin1001 conftool action : set/weight=30; selector: dc=codfw,cluster=api_appserver,service=canary [11:23:16] (03PS2) 10Marostegui: data.yaml: Add Ccoxwell [puppet] - 10https://gerrit.wikimedia.org/r/911277 (https://phabricator.wikimedia.org/T335150) [11:24:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/911277 (https://phabricator.wikimedia.org/T335150) (owner: 10Marostegui) [11:24:28] (03PS3) 10Marostegui: data.yaml: Add Ccoxwell [puppet] - 10https://gerrit.wikimedia.org/r/911277 (https://phabricator.wikimedia.org/T335150) [11:24:49] (03CR) 10Marostegui: data.yaml: Add Ccoxwell (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911277 (https://phabricator.wikimedia.org/T335150) (owner: 10Marostegui) [11:25:08] (03CR) 10Clément Goubert: [C: 03+2] P:mediawiki::common: Remove deploy check_dsh_group [puppet] - 10https://gerrit.wikimedia.org/r/911259 (https://phabricator.wikimedia.org/T329857) (owner: 10Clément Goubert) [11:25:44] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add Ccoxwell [puppet] - 10https://gerrit.wikimedia.org/r/911277 (https://phabricator.wikimedia.org/T335150) (owner: 10Marostegui) [11:26:57] 10SRE, 10Infrastructure-Foundations, 10serviceops: Deal with archival of Stretch on Debian mirrors - https://phabricator.wikimedia.org/T335282 (10MoritzMuehlenhoff) [11:27:30] (03PS1) 10Jbond: builder: drop stretch [puppet] - 10https://gerrit.wikimedia.org/r/911278 [11:27:36] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10Marostegui) 05Open→03Resolved a:03Marostegui User verified. Added to WMF LDAP group, wmf-nda Phabricator group and data.yaml. Please allow 30 minutes for puppet to run... [11:28:04] (03CR) 10CI reject: [V: 04-1] builder: drop stretch [puppet] - 10https://gerrit.wikimedia.org/r/911278 (owner: 10Jbond) [11:28:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40800/console" [puppet] - 10https://gerrit.wikimedia.org/r/911278 (owner: 10Jbond) [11:29:08] (03CR) 10Muehlenhoff: "Can you link this to T335282 (which I have just created)?" [puppet] - 10https://gerrit.wikimedia.org/r/911278 (owner: 10Jbond) [11:29:16] (03PS2) 10Jbond: builder: drop stretch [puppet] - 10https://gerrit.wikimedia.org/r/911278 [11:29:50] (03CR) 10CI reject: [V: 04-1] builder: drop stretch [puppet] - 10https://gerrit.wikimedia.org/r/911278 (owner: 10Jbond) [11:31:31] (03PS3) 10Jbond: builder: drop stretch [puppet] - 10https://gerrit.wikimedia.org/r/911278 (https://phabricator.wikimedia.org/T335282) [11:32:05] (03CR) 10CI reject: [V: 04-1] builder: drop stretch [puppet] - 10https://gerrit.wikimedia.org/r/911278 (https://phabricator.wikimedia.org/T335282) (owner: 10Jbond) [11:34:45] (03PS1) 10Jbond: docker-reporter: Exclude stretch images from reports [puppet] - 10https://gerrit.wikimedia.org/r/911279 (https://phabricator.wikimedia.org/T335282) [11:35:14] 10SRE, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth, 10Mobile: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out stat... - https://phabricator.wikimedia.org/T335125 [11:36:10] (03PS1) 10Muehlenhoff: No longer use mirrors.debian.org on Stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) [11:38:17] (03CR) 10CI reject: [V: 04-1] No longer use mirrors.debian.org on Stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [11:38:52] (03PS4) 10Jbond: builder: drop stretch [puppet] - 10https://gerrit.wikimedia.org/r/911278 (https://phabricator.wikimedia.org/T335282) [11:39:12] (03CR) 10Jbond: builder: drop stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911278 (https://phabricator.wikimedia.org/T335282) (owner: 10Jbond) [11:39:22] (03PS2) 10Jbond: docker-reporter: Exclude stretch images from reports [puppet] - 10https://gerrit.wikimedia.org/r/911279 (https://phabricator.wikimedia.org/T335282) [11:40:25] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review: Deal with archival of Stretch on Debian mirrors - https://phabricator.wikimedia.org/T335282 (10JMeybohm) [11:41:32] (03PS2) 10Muehlenhoff: No longer use mirrors.debian.org on Stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) [11:43:45] (03CR) 10Jgiannelos: [C: 03+1] push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert) [11:44:19] (03PS1) 10Muehlenhoff: Stop building stretch baseimage [puppet] - 10https://gerrit.wikimedia.org/r/911281 (https://phabricator.wikimedia.org/T335282) [11:44:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [11:46:50] (03PS1) 10Jbond: ceph_disks: ensure all confines are considered [puppet] - 10https://gerrit.wikimedia.org/r/911282 (https://phabricator.wikimedia.org/T330151) [11:47:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] ceph_disks: ensure all confines are considered [puppet] - 10https://gerrit.wikimedia.org/r/911282 (https://phabricator.wikimedia.org/T330151) (owner: 10Jbond) [11:49:13] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:50:33] (03PS3) 10Muehlenhoff: No longer use mirrors.debian.org on Stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) [11:54:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [11:57:12] (03PS1) 10Jbond: ceph_disks: Skip if we dont have drive information: [puppet] - 10https://gerrit.wikimedia.org/r/911284 (https://phabricator.wikimedia.org/T330151) [12:01:12] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment enable ldap property editor. [puppet] - 10https://gerrit.wikimedia.org/r/909658 (owner: 10Slyngshede) [12:01:15] (03CR) 10Jbond: [C: 03+2] ceph_disks: Skip if we dont have drive information: [puppet] - 10https://gerrit.wikimedia.org/r/911284 (https://phabricator.wikimedia.org/T330151) (owner: 10Jbond) [12:01:44] slyngs: happy for me to merge yours [12:01:52] Yes [12:02:11] (03PS4) 10Muehlenhoff: No longer use mirrors.debian.org on Stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) [12:02:14] done [12:02:17] Thanks [12:03:21] (03PS1) 10EoghanGaffney: Fix warning message for DNS discrepancies [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 [12:04:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [12:07:55] (03PS1) 10Slyngshede: C:idm:deployment fix missing import [puppet] - 10https://gerrit.wikimedia.org/r/911286 [12:08:41] (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment fix missing import [puppet] - 10https://gerrit.wikimedia.org/r/911286 (owner: 10Slyngshede) [12:09:51] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10RobH) a:05RobH→03wiki_willy Ok I went away for the weekend and came back with 100s of notifications from the SSH down tasks. These seem to be false positives and fire too often, who can we chat with to raise the threshholds on the... [12:12:27] (03CR) 10Jelto: [C: 03+1] "lgtm, one question in line" [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 (owner: 10EoghanGaffney) [12:14:34] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:14:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/911279 (https://phabricator.wikimedia.org/T335282) (owner: 10Jbond) [12:14:47] (03PS2) 10EoghanGaffney: Fix warning message for DNS discrepancies [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 [12:16:31] (03CR) 10Clément Goubert: [C: 03+2] push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert) [12:17:20] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:17:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:19:22] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:20:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/911278 (https://phabricator.wikimedia.org/T335282) (owner: 10Jbond) [12:20:43] (03CR) 10EoghanGaffney: Fix warning message for DNS discrepancies (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 (owner: 10EoghanGaffney) [12:23:32] (03Merged) 10jenkins-bot: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert) [12:24:12] claime: I am not sure if mw api will actually raise an error for a failing token or just return 20x with an error :/ [12:24:23] but as long as we can GET the api it should be good enough [12:24:46] nemo-yiannis: I will be checking the kubectl get log which are pretty verbose [12:24:50] (03CR) 10JMeybohm: [C: 03+1] Stop building stretch baseimage [puppet] - 10https://gerrit.wikimedia.org/r/911281 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [12:24:58] If I'm quick enough I should catch only my request :p [12:26:11] (03PS1) 10Jbond: ceph_disks: add more info based on pd list [puppet] - 10https://gerrit.wikimedia.org/r/911287 (https://phabricator.wikimedia.org/T330151) [12:26:26] nemo-yiannis: There's something bothering me, I need to check before merging [12:26:35] (03CR) 10Jelto: Fix warning message for DNS discrepancies (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 (owner: 10EoghanGaffney) [12:28:18] nemo-yiannis: ok it's all right, I checked too quick, the update hadn't been pooled [12:28:31] (03PS2) 10Jbond: ceph_disks: add more info based on pd list [puppet] - 10https://gerrit.wikimedia.org/r/911287 (https://phabricator.wikimedia.org/T330151) [12:28:44] !log Deploying push-notifications staging for switch to mw-api-int - T334061 [12:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:50] T334061: Migrate push-notifications to mw-api-int - https://phabricator.wikimedia.org/T334061 [12:28:53] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [12:29:05] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [12:29:28] (03PS7) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) [12:30:18] nemo-yiannis: I could see my request in queue, and then a 200 api_error [12:30:19] (03PS6) 10Jbond: git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 [12:30:29] (03PS67) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [12:30:57] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10RobH) 05Open→03Resolved a:05RobH→03None Ok, I think these alerts are false positives, and they are updated so often from prometheus to make them not useful. On these, it shows the following at this time: > description**: The... [12:31:26] nemo-yiannis: And I can see the requests in mw-api-int logs [12:32:01] nemo-yiannis: https://logstash.wikimedia.org/goto/f447d7bf1f4e6db65712f6c28fe15062 [12:32:02] Is that the error: "Invalid CSRF token." ? [12:32:10] ah thanks! [12:33:18] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:33:21] (03CR) 10Krinkle: Define dummy pass for passwords::excimer_ui_server (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [12:34:55] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40802/console" [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:34:57] (03CR) 10Jbond: [C: 03+2] ceph_disks: add more info based on pd list [puppet] - 10https://gerrit.wikimedia.org/r/911287 (https://phabricator.wikimedia.org/T330151) (owner: 10Jbond) [12:35:09] (03CR) 10Marostegui: [C: 03+1] Define dummy pass for passwords::excimer_ui_server [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [12:35:19] nemo-yiannis: I pasted the relevant logs to the task, it looks ok to me (if we except that 200 error is... weird) [12:35:32] yeah in terms of connectivity looks OK [12:36:10] Should I go ahead and change the production config ? Or do you want to stay with staging this way a bit longer ? [12:36:59] (03CR) 10Marostegui: [C: 03+1] "I can deploy this to our private production puppet, once you are ready for it." [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [12:37:26] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:38:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [12:38:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/911281 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [12:38:34] (03CR) 10Jbond: [C: 03+2] docker-reporter: Exclude stretch images from reports [puppet] - 10https://gerrit.wikimedia.org/r/911279 (https://phabricator.wikimedia.org/T335282) (owner: 10Jbond) [12:38:52] (03CR) 10Jbond: [C: 03+2] builder: drop stretch [puppet] - 10https://gerrit.wikimedia.org/r/911278 (https://phabricator.wikimedia.org/T335282) (owner: 10Jbond) [12:39:14] (03PS3) 10Clément Goubert: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/908955 (https://phabricator.wikimedia.org/T334064) [12:43:30] (03CR) 10CI reject: [V: 04-1] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/908955 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [12:45:00] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-apiserver.service,kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:29] this is me [12:48:54] ack [12:50:54] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10RobH) 05Open→03Resolved So every single one of the items currently listed as ssh down in the constantly changing but inaccurate task description are: ` robh@cumin1001:~$ ping re0.cr2-esams.mgmt.esams.wmnet PING re0.cr2-esams.mgmt... [12:50:57] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10RobH) [12:51:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10RobH) a:05RobH→03None [12:51:24] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/911288 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert) [12:51:26] (03CR) 10Ottomata: [C: 03+1] "LGTM! You'll need that site.pp entry too when you are ready to apply this role to a specific node." [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [12:52:42] (03PS1) 10JMeybohm: Revert "k8s: Configure the IPv6 service ip range for apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/910873 [12:52:49] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10RobH) ` robh@cumin1001:~$ ping asw2-ulsfo.mgmt.ulsfo.wmnet PING asw2-ulsfo.mgmt.ulsfo.wmnet (10.128.128.7) 56(84) bytes of data. 64 bytes from asw2-ulsfo.mgmt.ulsfo.wmnet (10.128.128.7): icmp_seq=1 ttl=60 time=71.5 ms 64 bytes from asw... [12:53:10] (03PS2) 10JMeybohm: Revert "k8s: Configure the IPv6 service ip range for apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/910873 (https://phabricator.wikimedia.org/T307943) [12:53:10] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:11] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10RobH) 05Open→03Resolved p:05Triage→03Lowest a:05RobH→03None [12:53:44] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334783 (10RobH) 05Open→03Resolved a:05RobH→03None ` robh@cumin1001:~$ ping cr2-drmrs.mgmt.drmrs.wmnet PING cr2-drmrs.mgmt.drmrs.wmnet (10.136.128.7) 56(84) bytes of data. 64 bytes from cr2-drmrs.mgmt.drmrs.wmnet (10.136.128.7): icmp_seq=... [12:53:56] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334783 (10RobH) [12:54:09] (03CR) 10JMeybohm: [C: 03+2] Revert "k8s: Configure the IPv6 service ip range for apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/910873 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:54:45] (03PS4) 10Clément Goubert: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/908955 (https://phabricator.wikimedia.org/T334064) [12:55:40] (03CR) 10Jgiannelos: [C: 03+1] push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/911288 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert) [12:56:26] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [12:56:59] nemo-yiannis: Same test as earlier ? [12:57:13] (with the right url ofc) [12:57:44] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:59:36] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:08] RoanKattouw, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T1300). [13:00:09] aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] I can deploy today! [13:00:23] \o/ [13:00:39] 👋 [13:02:30] PROBLEM - SSH on wdqs1015 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:02:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910018 (https://phabricator.wikimedia.org/T335090) (owner: 10Anzx) [13:02:53] (03PS3) 10EoghanGaffney: Fix warning message for DNS discrepancies [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 [13:03:34] (03Merged) 10jenkins-bot: Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910018 (https://phabricator.wikimedia.org/T335090) (owner: 10Anzx) [13:03:50] (03CR) 10Muehlenhoff: [C: 03+2] Stop building stretch baseimage [puppet] - 10https://gerrit.wikimedia.org/r/911281 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [13:03:56] (03PS2) 10Muehlenhoff: Stop building stretch baseimage [puppet] - 10https://gerrit.wikimedia.org/r/911281 (https://phabricator.wikimedia.org/T335282) [13:04:09] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:910018|Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwikisource (T335090)]] [13:04:12] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:14] T335090: Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwikisource - https://phabricator.wikimedia.org/T335090 [13:05:25] nemo-yiannis: I tested a push notification on my personal account before deploying, that works correctly, so I'm going to deploy to prod and do the same end to end test [13:05:27] !log urbanecm@deploy2002 urbanecm and anzx: Backport for [[gerrit:910018|Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwikisource (T335090)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:05:47] (03CR) 10Ottomata: "Nice! comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal) [13:05:59] (03CR) 10Clément Goubert: [C: 03+2] push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/911288 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert) [13:06:18] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:06:19] aanzx: can you test your change at mwdebug1001? [13:06:43] Ok [13:07:13] (03PS1) 10MVernon: swift: storage schema for larger disks_by_path backends [puppet] - 10https://gerrit.wikimedia.org/r/911290 (https://phabricator.wikimedia.org/T335275) [13:08:08] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911290 (https://phabricator.wikimedia.org/T335275) (owner: 10MVernon) [13:09:28] urbanecm: looks fine [13:09:31] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 (owner: 10EoghanGaffney) [13:09:32] thanks, deploying [13:10:32] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:10:36] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.212 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:11:18] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:25] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:45] (03Merged) 10jenkins-bot: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/911288 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert) [13:12:55] (03CR) 10EoghanGaffney: [C: 03+2] Fix warning message for DNS discrepancies [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 (owner: 10EoghanGaffney) [13:13:17] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [13:13:34] !log Deploying push-notifications production for switch to mw-api-int - T334061 [13:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:40] T334061: Migrate push-notifications to mw-api-int - https://phabricator.wikimedia.org/T334061 [13:13:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:14:06] (03PS1) 10Slyngshede: C:idm::deployment fix users DN. [puppet] - 10https://gerrit.wikimedia.org/r/911293 [13:14:26] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [13:14:55] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [13:15:11] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:910018|Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwikisource (T335090)]] (duration: 11m 02s) [13:15:16] T335090: Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwikisource - https://phabricator.wikimedia.org/T335090 [13:15:17] aanzx: should be live! [13:15:19] anything else? [13:15:20] (03Merged) 10jenkins-bot: Fix warning message for DNS discrepancies [cookbooks] - 10https://gerrit.wikimedia.org/r/911285 (owner: 10EoghanGaffney) [13:15:30] urbanecm: thanks [13:15:35] no problem [13:15:54] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment fix users DN. [puppet] - 10https://gerrit.wikimedia.org/r/911293 (owner: 10Slyngshede) [13:18:36] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/911290 (https://phabricator.wikimedia.org/T335275) (owner: 10MVernon) [13:18:56] (03CR) 10Slyngshede: [C: 03+2] SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [13:19:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [13:19:05] nemo-yiannis: Only deployed on eqiad, I'm waiting for the timeout to hit so I can see if it's routing to the right api endpoint [13:19:13] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:19:14] sounds good [13:19:46] (03Merged) 10jenkins-bot: Update InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [13:20:00] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:910723|Update InterwikiSortOrders (T335019)]] [13:20:04] (03PS1) 10Stevemunene: Add a postgresql database and user for airflow_analytics_product [puppet] - 10https://gerrit.wikimedia.org/r/911296 (https://phabricator.wikimedia.org/T333000) [13:20:05] T335019: Post-creation work for fatwiki - https://phabricator.wikimedia.org/T335019 [13:20:38] (03CR) 10Muehlenhoff: [C: 03+2] No longer use mirrors.debian.org on Stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/911280 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [13:22:11] nemo-yiannis: looks good to me in eqiad, I got the hits, deploying codfw and then doing end to end test [13:22:20] 👍 [13:22:39] Did you get the same 200 error ? [13:22:48] yep [13:23:17] And the same "Persisting session for unknown reason" on the mw api side [13:23:34] with UA PushNotifications/WMF [13:23:56] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [13:24:23] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [13:24:32] !log eoghan@cumin1001 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1003.wikimedia.org to gitlab1004.wikimedia.org [13:25:16] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [13:25:34] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/900400 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [13:25:45] (03CR) 10David Caro: [V: 03+1 C: 03+1] "Tested by fnegri on toolsbeta" [puppet] - 10https://gerrit.wikimedia.org/r/900400 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [13:26:50] (03CR) 10FNegri: [C: 03+2] [tbs.harbor] Remove duplicate pwd [puppet] - 10https://gerrit.wikimedia.org/r/900400 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [13:26:59] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:910723|Update InterwikiSortOrders (T335019)]] (duration: 06m 59s) [13:27:05] T335019: Post-creation work for fatwiki - https://phabricator.wikimedia.org/T335019 [13:27:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:02] (03PS1) 10Muehlenhoff: Don't add debian-debug for Stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/911297 (https://phabricator.wikimedia.org/T335282) [13:28:09] * urbanecm done [13:29:12] nemo-yiannis: Getting hits from push notification useragent, and just got my test end to end push [13:29:14] looks good [13:30:02] (03CR) 10FNegri: [C: 03+1] "👍🏻" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/909991 (owner: 10David Caro) [13:30:17] (03CR) 10David Caro: [C: 03+2] build_deb: use wikimedia images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/909991 (owner: 10David Caro) [13:31:11] (03Merged) 10jenkins-bot: build_deb: use wikimedia images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/909991 (owner: 10David Caro) [13:31:25] (03PS1) 10Slyngshede: C:idm:deployment Enable SSH keymanagement. [puppet] - 10https://gerrit.wikimedia.org/r/911298 [13:31:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:31:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/911297 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [13:32:09] !log installing libxml2 security updates on bullseye [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:31] !log Deployed push-notifications production for switch to mw-api-int - T334061 [13:32:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:35] T334061: Migrate push-notifications to mw-api-int - https://phabricator.wikimedia.org/T334061 [13:33:54] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [13:34:07] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [13:34:13] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:39:56] (03PS3) 10JMeybohm: Install flink operator in wikikube staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [13:40:36] !log repooling wdqs1005 [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:21] (03PS5) 10Clément Goubert: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/908955 (https://phabricator.wikimedia.org/T334064) [13:49:17] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:52:42] (03PS1) 10Muehlenhoff: builder: Update readme [puppet] - 10https://gerrit.wikimedia.org/r/911303 [13:55:49] (03CR) 10Joal: "Thanks for the review @otto - patch following" [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal) [13:56:09] (03PS3) 10Joal: Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) [13:59:08] (03PS8) 10Cmelo: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) [14:07:09] !log beginning alert host failover from alert2001 to alert1001 T333837 [14:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:16] T333837: failover alert2001 to alert1001 - https://phabricator.wikimedia.org/T333837 [14:07:43] !log disabled icinga meta monitoring on wikitech-static T333837 [14:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:12] (03PS1) 10Stang: Close cnwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911308 (https://phabricator.wikimedia.org/T274083) [14:09:16] (03PS1) 10Herron: Revert "alerting_host: failover icinga and alertmanger from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/910878 (https://phabricator.wikimedia.org/T333837) [14:09:45] (03PS1) 10Herron: Revert "dns: repoint alert host services to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/910879 (https://phabricator.wikimedia.org/T333837) [14:10:10] (03PS2) 10Herron: Revert "dns: repoint alert host services to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/910879 (https://phabricator.wikimedia.org/T333837) [14:11:29] (03CR) 10Ottomata: Refactor dumps::web::fetches::analytics::job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal) [14:12:06] (03CR) 10Herron: [C: 03+2] Revert "alerting_host: failover icinga and alertmanger from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/910878 (https://phabricator.wikimedia.org/T333837) (owner: 10Herron) [14:13:45] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) (owner: 10Michael Große) [14:14:18] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:16:00] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [14:16:18] (03PS1) 10JMeybohm: admin_ng: Add .Values.chartVersions to helmfile example [deployment-charts] - 10https://gerrit.wikimedia.org/r/911314 [14:16:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/911297 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [14:17:00] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [14:17:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [14:19:56] (03CR) 10Herron: [C: 03+2] Revert "dns: repoint alert host services to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/910879 (https://phabricator.wikimedia.org/T333837) (owner: 10Herron) [14:20:54] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:22:24] (03CR) 10EoghanGaffney: [C: 03+2] Switch gitlab-replica and gitlab-replica-old hosts [puppet] - 10https://gerrit.wikimedia.org/r/909244 (https://phabricator.wikimedia.org/T334838) (owner: 10EoghanGaffney) [14:23:52] (03CR) 10JMeybohm: [C: 04-1] Install flink operator in wikikube staging-eqiad (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [14:27:12] (03CR) 10Michael Große: Beta-Wikidata: Enable Labels in Wikidata edit summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) (owner: 10Michael Große) [14:27:49] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Add .Values.chartVersions to helmfile example [deployment-charts] - 10https://gerrit.wikimedia.org/r/911314 (owner: 10JMeybohm) [14:29:53] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10Jclark-ctr) Rebalanced power [14:30:13] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10Jclark-ctr) 05Open→03Resolved [14:30:34] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:43] !log re-enabled icinga meta monitoring on wikitech-static T333837 [14:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:49] T333837: failover alert2001 to alert1001 - https://phabricator.wikimedia.org/T333837 [14:33:18] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations: Unduplicate beta cluster hiera keys set both in Horizon and in ops/puppet - https://phabricator.wikimedia.org/T277680 (10joanna_borun) [14:34:56] (03Merged) 10jenkins-bot: admin_ng: Add .Values.chartVersions to helmfile example [deployment-charts] - 10https://gerrit.wikimedia.org/r/911314 (owner: 10JMeybohm) [14:35:03] (JobUnavailable) firing: (2) Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:35:46] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335294 (10phaultfinder) [14:35:48] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [14:35:50] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [14:36:09] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: mailman3: First/Last name should not be mandatory fields - https://phabricator.wikimedia.org/T312020 (10MarcoAurelio) [[ https://gitlab.com/mailman/django-mailman3/-/commit/1375dcda3328125baab2707a42b10587af893127 | Fixed upstream ]]. Not in any release yet as fa... [14:38:36] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: mailman3: First/Last name should not be mandatory fields - https://phabricator.wikimedia.org/T312020 (10MarcoAurelio) 05Open→03Resolved I'll close this task since the fix upstream was merged and there's nothing AFAIK left to do here. Once this gets into a tag... [14:41:10] (03PS1) 10Stevemunene: Dummy db for new product analytics airflow [labs/private] - 10https://gerrit.wikimedia.org/r/911319 (https://phabricator.wikimedia.org/T333000) [14:42:13] 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10joanna_borun) a:03joanna_borun [14:43:42] (03PS2) 10Dzahn: Add btm to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/910538 (https://phabricator.wikimedia.org/T335216) (owner: 10Gerrit maintenance bot) [14:46:23] (03CR) 10Dzahn: [C: 03+2] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wiktionary_Mandailing" [dns] - 10https://gerrit.wikimedia.org/r/910538 (https://phabricator.wikimedia.org/T335216) (owner: 10Gerrit maintenance bot) [14:47:28] !log DNS - new project language "btm" added - Mandailing language is spoken in Indonesia - https://en.wikipedia.org/wiki/Mandailing_language [14:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:17] (03PS3) 10EoghanGaffney: Move DNS names for gitlab-replica{,-old} [dns] - 10https://gerrit.wikimedia.org/r/909248 (https://phabricator.wikimedia.org/T334838) [14:50:02] (JobUnavailable) firing: (2) Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:52:59] (03CR) 10EoghanGaffney: [C: 03+2] Move DNS names for gitlab-replica{,-old} [dns] - 10https://gerrit.wikimedia.org/r/909248 (https://phabricator.wikimedia.org/T334838) (owner: 10EoghanGaffney) [14:55:46] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335298 (10phaultfinder) [14:55:48] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [14:55:50] 10Puppet, 10Cloud-VPS, 10Data-Persistence, 10Infrastructure-Foundations, and 3 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10jbond) 05Open→03Stalled [14:55:57] 10Puppet, 10Cloud-VPS, 10Data-Persistence, 10Infrastructure-Foundations, and 3 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10jbond) p:05Triage→03Medium [14:56:52] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [14:56:55] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [14:58:00] !log bking@wdqs1015 repool wdqs1015 as lag is back down [14:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:16] !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab1003.wikimedia.org to gitlab1004.wikimedia.org [15:05:02] (JobUnavailable) resolved: (2) Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:49] !log robh@cumin1001 START - Cookbook sre.dns.netbox [15:07:51] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: old cp server work - robh@cumin1001" [15:08:32] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) [15:08:41] !log restarting haproxy on cp3064 - T334448 [15:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:47] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [15:09:07] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: old cp server work - robh@cumin1001" [15:09:07] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:09:13] !log robh@cumin1001 START - Cookbook sre.dns.netbox [15:09:26] bleh forgot to set to planned for prod dns [15:11:36] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: old cp server work - robh@cumin1001" [15:13:03] (03CR) 10Muehlenhoff: [C: 03+2] builder: Update readme [puppet] - 10https://gerrit.wikimedia.org/r/911303 (owner: 10Muehlenhoff) [15:14:28] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: old cp server work - robh@cumin1001" [15:14:28] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:06] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10EBernhardson) [15:15:14] 10SRE, 10Content-Transform-Team-WIP, 10RESTBase, 10Traffic, and 5 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10Jgiannelos) a:03Jgiannelos [15:20:12] (03PS1) 10Muehlenhoff: Remove obsolete nodejs images only used on Stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911329 (https://phabricator.wikimedia.org/T335282) [15:24:02] (03CR) 10Muehlenhoff: [C: 03+2] Don't add debian-debug for Stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/911297 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [15:25:08] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ssh-gitlab.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:14] (03CR) 10Joal: "Patch on its way!" [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) (owner: 10Joal) [15:26:48] .36 [15:27:17] (03PS4) 10Joal: Refactor dumps::web::fetches::analytics::job [puppet] - 10https://gerrit.wikimedia.org/r/910761 (https://phabricator.wikimedia.org/T317167) [15:27:52] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10LSobanski) [15:28:29] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [15:30:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T1530). [15:34:46] (03PS1) 10Muehlenhoff: profile::toolforge::docker::image_builder: No longer use docker::baseimages [puppet] - 10https://gerrit.wikimedia.org/r/911331 (https://phabricator.wikimedia.org/T335282) [15:36:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:26] 10SRE, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Jhancock.wm) [15:38:07] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Jhancock.wm) [15:41:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:41:17] (03CR) 10Jelto: sre: update planned quarters and tickets for collab services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908644 (owner: 10Dzahn) [15:47:08] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove obsolete nodejs images only used on Stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911329 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [15:47:36] (03CR) 10JHathaway: [C: 03+1] "Looks good to me, are there any concerns with both the new and the old mailman server having write access at the same time?" [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) (owner: 10Ladsgroup) [15:50:26] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:54] 10Puppet, 10Beta-Cluster-Infrastructure: Unduplicate beta cluster hiera keys set both in Horizon and in ops/puppet - https://phabricator.wikimedia.org/T277680 (10Aklapper) @joanna_borun: Herald added back the project tag due to H389 / T285143. If you'd like that changed, please file a separate ticket - thanks! [16:03:53] 10Puppet, 10Infrastructure-Foundations: systemd-timer puppet code triggers an execution when applying a schedule change - https://phabricator.wikimedia.org/T329158 (10jbond) p:05Triage→03Medium I have been unable to recreate this, the only thing that is called is `systemctl daemon-reload` however this does... [16:05:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Beta-Wikidata: Enable Labels in Wikidata edit summaries (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) (owner: 10Michael Große) [16:15:24] (03CR) 10BryanDavis: toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909397 (https://phabricator.wikimedia.org/T334925) (owner: 10BryanDavis) [16:18:56] 10SRE, 10Keyholder, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) [16:28:41] (03PS1) 10Jbond: dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) [16:31:12] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, and 2 others: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10jbond) Im unable to recreate this did you fix it. either way i think this would be more strict if you pushed the DNS resol... [16:34:25] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp5013 [16:35:08] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5013 [16:35:20] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp5014 [16:36:00] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5014 [16:36:07] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp5015 [16:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:37:10] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5015 [16:37:18] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp5016 [16:39:02] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5016 [16:49:18] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:51:52] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10wiki_willy) a:05wiki_willy→03RobH @RobH - can you work with @fgiunchedi on this? This ties back to T310266, when the alert was first rolled out. But if you're able to ssh in and it continues to alert, I'm thinking maybe there's a... [16:51:56] PROBLEM - IPMI Sensor Status on mw2432 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:54:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:55:29] (03CR) 10Dzahn: "I don't dislike the idea to send the DNS lookups to the puppetmaster but I also think it might be best to simply keep IP addresses in Hier" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [16:59:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10akosiaris) > We could consider granting access for the switches to our own Docker registry for ease of management. Read or write? We probabl... [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T1700) [17:00:06] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T1700). [17:02:24] (03CR) 10Dzahn: dumps::distribution::ferm: update to resolve hosts in puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [17:02:36] Anyone seeing test failures with messages like "fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/GeoData/': The requested URL returned error: 502" [17:02:58] It's happening for a CheckUser change in gate-and-submit [17:03:06] Wondering if it's known or not [17:03:38] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:04:32] Dreamy_Jazz: did it happen temporarily in the past? that URL is a 404 for me, not a 502 [17:04:44] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp5013.mgmt.eqsin.wmnet with reboot policy FORCED [17:04:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:05:05] Hmm. It happened as part of the zuul clone [17:05:49] so..that is a 404 [17:05:56] Same with https://gerrit.wikimedia.org/r/mediawiki/extensions/ProofreadPage/ [17:05:59] my guess is now this moved from gerrit to gitlab maybe [17:06:06] fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/ProofreadPage/': The requested URL returned error: 502' [17:06:09] let's talk about this in -releng [17:06:21] I think they are moving repos to gitlab [17:06:33] Thanks [17:06:36] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Jhancock.wm) [17:19:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp5013.mgmt.eqsin.wmnet with reboot policy FORCED [17:20:02] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp5014.mgmt.eqsin.wmnet with reboot policy FORCED [17:26:14] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp5014.mgmt.eqsin.wmnet with reboot policy FORCED [17:26:59] (03PS1) 10Jbond: rake_modules: add check for namespaces hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/911346 (https://phabricator.wikimedia.org/T209265) [17:28:17] for the record, had nothing to do with moving repos. 404 is normal for that type of URL [17:30:23] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, and 2 others: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) I can confirm the issue what made me create this ticket is gone. So it's resolved. I don't know how it got resolved... [17:31:45] (03PS2) 10Jbond: rake_modules: add check for namespaces hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/911346 (https://phabricator.wikimedia.org/T209265) [17:32:14] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/911346 (https://phabricator.wikimedia.org/T209265) (owner: 10Jbond) [17:32:29] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp5015.mgmt.eqsin.wmnet with reboot policy FORCED [17:32:49] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, and 2 others: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) So, now the status is: ` [clouddumps1002:~] $ host ftp.acc.umu.se ftp.acc.umu.se has address 194.71.11.163 ftp.acc.... [17:34:17] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, and 2 others: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) a:05Dzahn→03None [17:34:43] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, and 2 others: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) 05Open→03Resolved a:03Dzahn [17:35:59] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp5015.mgmt.eqsin.wmnet with reboot policy FORCED [17:36:22] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp5013.mgmt.eqsin.wmnet with reboot policy FORCED [17:42:01] (03PS3) 10Dzahn: sre: update planned quarters and tickets for collab services [puppet] - 10https://gerrit.wikimedia.org/r/908644 (https://phabricator.wikimedia.org/T327068) [17:43:52] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp5013.mgmt.eqsin.wmnet with reboot policy FORCED [17:44:21] (03CR) 10Dzahn: "Nicer commit message, resolved inline comment / typo fix, adjusted quarters per our chat in team meeting today." [puppet] - 10https://gerrit.wikimedia.org/r/908644 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [17:47:42] (03CR) 10Dzahn: [C: 03+2] "@Muehlenhoff FYI, our team delivered a plan now" [puppet] - 10https://gerrit.wikimedia.org/r/908644 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [17:50:30] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:56:10] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:56:45] (03PS1) 10Krinkle: mediawiki: Add auto_prepend_file to PHP config_cli (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) [17:58:56] (03CR) 10Dzahn: "Let me try to amend this change in a way that we can apply it on the new machine, gerrit1003, without changing the existing prod server, g" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [17:59:12] (03CR) 10CI reject: [V: 04-1] mediawiki: Add auto_prepend_file to PHP config_cli (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [18:03:44] (03PS2) 10Krinkle: mediawiki: Add auto_prepend_file to PHP config_cli (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) [18:04:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2432:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2432 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:05:47] (03CR) 10CI reject: [V: 04-1] mediawiki: Add auto_prepend_file to PHP config_cli (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [18:06:19] (03PS3) 10Krinkle: mediawiki: Add auto_prepend_file to PHP config_cli (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) [18:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:21:44] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [18:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:34:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2185:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2185 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:36:04] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [18:36:06] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [18:44:20] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [18:55:59] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [18:56:46] PROBLEM - Query Service HTTP Port on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:56:52] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:56:52] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:57:00] PROBLEM - Check systemd state on wdqs2009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service,wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:06] PROBLEM - puppet last run on wdqs2009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:57:28] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 414 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:58:09] ^^ should clear soon, sorry for the wdqs spam [18:58:30] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:58:30] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2009 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:58:38] RECOVERY - Check systemd state on wdqs2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:08] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.212 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:00:02] RECOVERY - Query Service HTTP Port on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [19:00:06] eoghan and brennen: Dear deployers, time to do the Phabricator (Aphlict) update window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T1900). [19:01:07] (03PS1) 10Nray: Fix InvalidCharacterError: Failed to execute 'add' on 'DOMTokenList' [skins/Vector] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/911366 (https://phabricator.wikimedia.org/T335149) [19:01:15] o/ - we'll commence before long. [19:02:07] (03PS1) 10Gergő Tisza: [beta] Reenable Graph on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911351 (https://phabricator.wikimedia.org/T334895) [19:02:40] RECOVERY - puppet last run on wdqs2009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:02:51] (03PS2) 10Gergő Tisza: [beta] Reenable Graph on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911351 (https://phabricator.wikimedia.org/T334895) [19:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:05:48] (03CR) 10Jdlrobson: [C: 03+1] [beta] Reenable Graph on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911351 (https://phabricator.wikimedia.org/T334895) (owner: 10Gergő Tisza) [19:07:53] (03PS1) 10EoghanGaffney: Move aphlict.discovery.wmnet over to aphlict1002 [dns] - 10https://gerrit.wikimedia.org/r/911352 (https://phabricator.wikimedia.org/T333452) [19:08:18] I'll deploy a beta-only change. [19:09:28] (03CR) 10Gergő Tisza: [C: 03+2] [beta] Reenable Graph on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911351 (https://phabricator.wikimedia.org/T334895) (owner: 10Gergő Tisza) [19:10:24] (03Merged) 10jenkins-bot: [beta] Reenable Graph on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911351 (https://phabricator.wikimedia.org/T334895) (owner: 10Gergő Tisza) [19:19:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:23:47] (03CR) 10RLazarus: [C: 03+1] Move aphlict.discovery.wmnet over to aphlict1002 [dns] - 10https://gerrit.wikimedia.org/r/911352 (https://phabricator.wikimedia.org/T333452) (owner: 10EoghanGaffney) [19:23:50] (03CR) 10Dzahn: [C: 03+1] Move aphlict.discovery.wmnet over to aphlict1002 [dns] - 10https://gerrit.wikimedia.org/r/911352 (https://phabricator.wikimedia.org/T333452) (owner: 10EoghanGaffney) [19:27:09] (03CR) 10EoghanGaffney: [C: 03+2] Move aphlict.discovery.wmnet over to aphlict1002 [dns] - 10https://gerrit.wikimedia.org/r/911352 (https://phabricator.wikimedia.org/T333452) (owner: 10EoghanGaffney) [19:29:08] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache aphlict.discovery.wmnet on all recursors [19:29:11] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aphlict.discovery.wmnet on all recursors [19:33:41] (03PS1) 10EoghanGaffney: Add aphlict service to new vm [puppet] - 10https://gerrit.wikimedia.org/r/911357 (https://phabricator.wikimedia.org/T333452) [19:34:10] (03CR) 10Dzahn: [C: 03+1] Add aphlict service to new vm [puppet] - 10https://gerrit.wikimedia.org/r/911357 (https://phabricator.wikimedia.org/T333452) (owner: 10EoghanGaffney) [19:34:40] (03CR) 10EoghanGaffney: [C: 03+2] Add aphlict service to new vm [puppet] - 10https://gerrit.wikimedia.org/r/911357 (https://phabricator.wikimedia.org/T333452) (owner: 10EoghanGaffney) [19:34:52] (03PS1) 10Dzahn: gerrit: add /srv/gerrit/data/lfs to dirs managed by puppet [puppet] - 10https://gerrit.wikimedia.org/r/911358 (https://phabricator.wikimedia.org/T333143) [19:35:42] (03CR) 10Dzahn: [C: 03+2] "Just adding the directory, not changing app config." [puppet] - 10https://gerrit.wikimedia.org/r/911358 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [19:39:55] (03PS11) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [19:41:06] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:19] That's me, downtiming now [19:42:13] thanks! and for working on that [19:42:49] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on aphlict1001.eqiad.wmnet with reason: aphlict1002 is now active for testing [19:45:29] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on aphlict1001.eqiad.wmnet with reason: aphlict1002 is now active for testing [19:47:39] (03PS2) 10MusikAnimal: interwiki: update URL to XTools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910110 [19:55:32] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:55] (03CR) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [19:56:53] (03CR) 10BCornwall: [V: 03+1 C: 03+2] keyholder-proxy: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895885 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T2000). Please do the needful. [20:00:04] Superpes and nray: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] Hi :) [20:00:16] Hello o/ [20:00:46] (03CR) 10Samtar: [C: 03+1] "✔" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910110 (owner: 10MusikAnimal) [20:01:09] (03PS1) 10Dzahn: backup: add /srv/gerrit/data to fileset for gerrit repos [puppet] - 10https://gerrit.wikimedia.org/r/911362 (https://phabricator.wikimedia.org/T333143) [20:03:43] i can deploy o/ [20:03:57] \o/ cjming saves the day again! [20:04:07] lol [20:04:13] (03CR) 10Dzahn: [C: 03+2] backup: add /srv/gerrit/data to fileset for gerrit repos [puppet] - 10https://gerrit.wikimedia.org/r/911362 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [20:04:33] (03CR) 10Clare Ming: [C: 03+2] Fix InvalidCharacterError: Failed to execute 'add' on 'DOMTokenList' [skins/Vector] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/911366 (https://phabricator.wikimedia.org/T335149) (owner: 10Nray) [20:04:39] Lol :D [20:05:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910603 (https://phabricator.wikimedia.org/T335162) (owner: 10Superpes15) [20:11:10] (03Merged) 10jenkins-bot: [kcgwiktionary] Add a HD logo for vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910603 (https://phabricator.wikimedia.org/T335162) (owner: 10Superpes15) [20:11:26] !log cjming@deploy2002 Started scap: Backport for [[gerrit:910603|[kcgwiktionary] Add a HD logo for vector legacy (T335162)]] [20:11:31] T335162: Set logo for kcgwiktionary and guwwikinews - https://phabricator.wikimedia.org/T335162 [20:13:01] !log cjming@deploy2002 superpes and cjming: Backport for [[gerrit:910603|[kcgwiktionary] Add a HD logo for vector legacy (T335162)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:13:05] hi Superpes: your 1st patch can be tested on any debug server [20:13:08] Looking :) [20:13:32] Looks fine! cjming Thanks :) [20:13:39] great - syncing [20:14:08] (03PS4) 10Clare Ming: [guwwikinews] Add a HD logo for vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910604 (https://phabricator.wikimedia.org/T335162) (owner: 10Superpes15) [20:19:17] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:910603|[kcgwiktionary] Add a HD logo for vector legacy (T335162)]] (duration: 07m 51s) [20:19:23] T335162: Set logo for kcgwiktionary and guwwikinews - https://phabricator.wikimedia.org/T335162 [20:19:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:19:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910604 (https://phabricator.wikimedia.org/T335162) (owner: 10Superpes15) [20:20:54] (03Merged) 10jenkins-bot: [guwwikinews] Add a HD logo for vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910604 (https://phabricator.wikimedia.org/T335162) (owner: 10Superpes15) [20:21:07] !log cjming@deploy2002 Started scap: Backport for [[gerrit:910604|[guwwikinews] Add a HD logo for vector legacy (T335162)]] [20:22:18] !log cjming@deploy2002 superpes and cjming: Backport for [[gerrit:910604|[guwwikinews] Add a HD logo for vector legacy (T335162)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:22:22] Superpes: 1st patch is live - 2nd patch is ready to be tested [20:22:27] Testing! [20:23:01] Also the 2nd is fine cjming thanks! [20:23:08] cool - syncing [20:23:34] (03PS2) 10Clare Ming: [fywiki] Add portal and portal talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910857 (https://phabricator.wikimedia.org/T334807) (owner: 10Superpes15) [20:24:22] (03Merged) 10jenkins-bot: Fix InvalidCharacterError: Failed to execute 'add' on 'DOMTokenList' [skins/Vector] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/911366 (https://phabricator.wikimedia.org/T335149) (owner: 10Nray) [20:24:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:28:12] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:29] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:910604|[guwwikinews] Add a HD logo for vector legacy (T335162)]] (duration: 07m 22s) [20:28:35] T335162: Set logo for kcgwiktionary and guwwikinews - https://phabricator.wikimedia.org/T335162 [20:28:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910857 (https://phabricator.wikimedia.org/T334807) (owner: 10Superpes15) [20:29:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:29:38] (03Merged) 10jenkins-bot: [fywiki] Add portal and portal talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910857 (https://phabricator.wikimedia.org/T334807) (owner: 10Superpes15) [20:29:49] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:29:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:30:51] !log cjming@deploy2002 Started scap: Backport for [[gerrit:910857|[fywiki] Add portal and portal talk namespace (T334807)]] [20:30:57] T334807: Add Portal namespace on West Frisian Wikipedia - https://phabricator.wikimedia.org/T334807 [20:31:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:32:02] (03CR) 10Andrea Denisse: "Hello, these are the PCC results of the latest patch: https://puppet-compiler.wmflabs.org/output/909738/40808/" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [20:32:04] !log cjming@deploy2002 cjming and superpes: Backport for [[gerrit:910857|[fywiki] Add portal and portal talk namespace (T334807)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:32:07] Superpes: 2nd patch is live - is 3rd testable? [20:32:26] Yep! Just testing if the alias works! [20:32:41] Fine! Thanks cjming :) [20:32:51] syncing! [20:33:09] Oh don't forget to run NamespaceDupes.php after the deploy :) [20:33:33] how is that run? on the maintenance server? [20:33:38] cjming: yep [20:33:39] (03PS1) 10Dzahn: gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) [20:33:53] https://www.mediawiki.org/wiki/Manual:NamespaceDupes.php Yep [20:33:56] mwscript namespaceDupes.php in mwmaint [20:34:03] (03CR) 10CI reject: [V: 04-1] gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [20:34:26] Ciao herzog! :) [20:34:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:34:45] Ciao don Superpes :) [20:35:04] Lol "don" [20:35:17] (03PS2) 10Dzahn: gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) [20:35:55] herzog: shouldn't it be Gran duque? [20:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:36:53] mutante: I don't think ß is allowed in IRC nicks [20:37:24] (03CR) 10CI reject: [V: 04-1] gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [20:37:54] grossherzog: lol [20:38:00] Rotfl [20:38:18] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:910857|[fywiki] Add portal and portal talk namespace (T334807)]] (duration: 07m 26s) [20:38:24] T334807: Add Portal namespace on West Frisian Wikipedia - https://phabricator.wikimedia.org/T334807 [20:40:43] herzog: thanks -- what is the literal cmd if you happen to know? i'm in mwmaint [20:41:19] !log cjming@deploy2002 Started scap: Backport for [[gerrit:911366|Fix InvalidCharacterError: Failed to execute 'add' on 'DOMTokenList' (T335149)]] [20:41:25] T335149: InvalidCharacterError: Failed to execute 'add' on 'DOMTokenList': The token provided ('mw-ui-button mw-ui-quiet mw-ui-icon mw-ui-icon-element mw-ui-icon-bell') contains HTML space characters, which are not valid in tokens. - https://phabricator.wikimedia.org/T335149 [20:41:39] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@6e76561]: (no justification provided) [20:41:49] hi nray: deploying yours now [20:41:57] cjming: sounds good, thank you [20:42:02] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@6e76561]: (no justification provided) (duration: 00m 23s) [20:42:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:33] cjming Should be mwscript maintenance/namespaceDupes.php --wiki fywiki --fix (I suppose) [20:42:34] !log cjming@deploy2002 cjming and nray: Backport for [[gerrit:911366|Fix InvalidCharacterError: Failed to execute 'add' on 'DOMTokenList' (T335149)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:42:59] nray: shall i sync? [20:43:04] cjming: you may run mwscript namespaceDupes.php --wiki fywiki for a dry-run first [20:43:36] cjming: let me check the testservers [20:43:53] nray: sounds good - i'll await your green light [20:44:07] Superpes: ran the maintenance script - should be all good [20:44:48] Wonderful! And thanks for your time :D [20:45:02] thanks for your patience :) [20:45:32] (03PS3) 10Dzahn: gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) [20:46:52] cjming: looks good, you can proceed [20:47:03] great - syncing [20:47:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:47:38] (03CR) 10CI reject: [V: 04-1] gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [20:50:25] (03CR) 10Dzahn: [C: 03+1] "lgtm. it does create all the timer resources but if you look at change catalog you can see they are "stopped" because "auto_sync" is set t" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [20:52:45] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:911366|Fix InvalidCharacterError: Failed to execute 'add' on 'DOMTokenList' (T335149)]] (duration: 11m 25s) [20:52:50] T335149: InvalidCharacterError: Failed to execute 'add' on 'DOMTokenList': The token provided ('mw-ui-button mw-ui-quiet mw-ui-icon mw-ui-icon-element mw-ui-icon-bell') contains HTML space characters, which are not valid in tokens. - https://phabricator.wikimedia.org/T335149 [20:52:53] nray: should be live [20:53:06] (03PS4) 10Dzahn: gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) [20:53:09] cjming: great thank you! [20:53:28] ur welcome :) [20:53:49] !log end of UTC late backport window [20:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:09] (03CR) 10CI reject: [V: 04-1] gerrit: make the lfs data path configurable [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [20:59:14] (03CR) 10Dzahn: "@jbond jenkins-bot says -1 because it can't find a value for the lookup. is that my mistake or ..?" [puppet] - 10https://gerrit.wikimedia.org/r/911363 (https://phabricator.wikimedia.org/T333143) (owner: 10Dzahn) [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T2100). [21:01:24] (03CR) 10Herron: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [21:13:39] (03CR) 10Dzahn: "I broke this up into steps" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [21:14:49] (03PS2) 10Dzahn: gerrit: relocate LFS data [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [21:16:14] (03CR) 10Dzahn: "rebased but after https://gerrit.wikimedia.org/r/c/operations/puppet/+/911363/ it should change into just a Hiera change" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [21:18:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:14] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:33:52] (03PS1) 10Mhurd: gitlab runner: allow node:* images [puppet] - 10https://gerrit.wikimedia.org/r/911407 (https://phabricator.wikimedia.org/T335320) [21:35:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:50:30] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:51:39] (03CR) 10Cory Massaro: "Hello! I'm trying to add an AppArmor profile to our current Kubernetes deployment but running into issues. The supplied policy (allow only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [22:04:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2432:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2432 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:04:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:04:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:07:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:07:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:15:14] (03CR) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [22:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:20:46] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [22:22:01] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [22:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:34:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2185:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2185 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:40:48] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [22:40:51] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [23:00:45] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [23:01:36] papaul: Are these errors expected? [23:02:59] Ah, I see, RobH seems to have been messing with all that