[00:00:59] (03PS1) 10Cmelo: Remove multi organizers feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909401 (https://phabricator.wikimedia.org/T334088) [00:01:07] (03PS1) 10Andrew Bogott: Openstack env scripts: include OPENSTACK_DOMAIN_ID [puppet] - 10https://gerrit.wikimedia.org/r/909402 [00:16:28] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:18:06] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:06] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cassandra-dev2001.codfw.wmnet with reason: testing systemd unit changes — T327954 [00:28:11] T327954: session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors) - https://phabricator.wikimedia.org/T327954 [00:28:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cassandra-dev2001.codfw.wmnet with reason: testing systemd unit changes — T327954 [00:30:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:11] (03CR) 10Gergő Tisza: [C: 03+1] [Growth] Prepare for a Personalized praise config variable change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [00:34:29] (03CR) 10Gergő Tisza: [C: 03+1] [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [00:36:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:18] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908885 [00:39:32] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908885 (owner: 10TrainBranchBot) [00:46:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:14] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for cassandra-dev2001.codfw.wmnet [00:54:15] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cassandra-dev2001.codfw.wmnet [00:55:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908885 (owner: 10TrainBranchBot) [01:00:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:14] (03PS1) 10Eevans: Do not de-init node prior to restart [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) [01:03:37] (03CR) 10Eevans: [C: 04-1] "Untested; Not yet ready" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [01:05:58] (03PS7) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 [01:08:30] (03CR) 10Andrew Bogott: [C: 03+2] Openstack env scripts: include OPENSTACK_DOMAIN_ID [puppet] - 10https://gerrit.wikimedia.org/r/909402 (owner: 10Andrew Bogott) [01:10:20] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [01:10:49] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [01:19:42] (03PS2) 10Eevans: Do not de-init node prior to restart [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) [01:24:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:30] (03CR) 10Raymond Ndibe: [C: 03+1] "LGTM. I can create an installable build with this on my machine" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [01:45:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:54:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T0200) [02:06:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.5 [core] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/909406 (https://phabricator.wikimedia.org/T330211) [02:07:51] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.5 [core] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/909406 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [02:10:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:03] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.5 [core] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/909406 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [02:30:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:42:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T0300) [03:01:21] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909426 (https://phabricator.wikimedia.org/T330211) [03:01:23] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909426 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [03:02:09] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909426 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [03:02:34] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.5 refs T330211 [03:02:39] T330211: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211 [03:35:02] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:36:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:51:38] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.5 refs T330211 (duration: 49m 03s) [03:51:43] T330211: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211 [03:53:48] !log mwpresync@deploy2002 Pruned MediaWiki: 1.41.0-wmf.3 (duration: 02m 08s) [04:13:18] (03PS1) 10Andrew Bogott: cloudceph: add firewall access for cloudbackup2001 [puppet] - 10https://gerrit.wikimedia.org/r/909434 [04:13:55] (03CR) 10Andrew Bogott: [C: 03+2] cloudceph: add firewall access for cloudbackup2001 [puppet] - 10https://gerrit.wikimedia.org/r/909434 (owner: 10Andrew Bogott) [04:31:08] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:00] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:39:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:40:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:52:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:53:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:55:12] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:55:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:56:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.497 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:56:38] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:21:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T0600) [06:00:04] kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T0600). [06:01:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 16591 [06:01:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T334821 [06:02:03] T334821: Switchover x2 master db2144 -> db2142) - https://phabricator.wikimedia.org/T334821 [06:02:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 16591 [06:02:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T334821 [06:03:50] (03PS1) 10Marostegui: mariadb: Promote db2144 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/909482 (https://phabricator.wikimedia.org/T334821) [06:04:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2144 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/909482 (https://phabricator.wikimedia.org/T334821) (owner: 10Marostegui) [06:06:10] !log Starting x2 codfw failover from db2144 to db2142 - T334821 [06:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2142 to x2 primary T334821', diff saved to https://phabricator.wikimedia.org/P47054 and previous config saved to /var/cache/conftool/dbconfig/20230418-061101-root.json [06:11:06] T334821: Switchover x2 master db2144 -> db2142) - https://phabricator.wikimedia.org/T334821 [06:13:46] (03PS1) 10Marostegui: wmnet: Update x2 CNAME [dns] - 10https://gerrit.wikimedia.org/r/909483 (https://phabricator.wikimedia.org/T334821) [06:14:47] (03CR) 10Marostegui: [C: 03+2] wmnet: Update x2 CNAME [dns] - 10https://gerrit.wikimedia.org/r/909483 (https://phabricator.wikimedia.org/T334821) (owner: 10Marostegui) [06:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:21:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:48] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:35:08] (03PS1) 10Marostegui: mariadb: Productionize db1219 [puppet] - 10https://gerrit.wikimedia.org/r/909484 (https://phabricator.wikimedia.org/T326669) [06:35:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1219 [puppet] - 10https://gerrit.wikimedia.org/r/909484 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:35:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:45:33] (JobUnavailable) resolved: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:46:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:38] (ProbeDown) resolved: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:47:54] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [06:50:38] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:52:50] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [06:54:32] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:57:17] (03PS1) 10Marostegui: db1217: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909548 (https://phabricator.wikimedia.org/T326669) [06:57:46] (03CR) 10Marostegui: [C: 03+2] db1217: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909548 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:59:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1109.eqiad.wmnet [07:00:02] (03PS1) 10Marostegui: mariadb: Decommission db1109 [puppet] - 10https://gerrit.wikimedia.org/r/909550 (https://phabricator.wikimedia.org/T334820) [07:00:06] Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T0700). [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:00:19] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 (10Marostegui) [07:03:07] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:03:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1109 [puppet] - 10https://gerrit.wikimedia.org/r/909550 (https://phabricator.wikimedia.org/T334820) (owner: 10Marostegui) [07:03:30] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:04:14] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 (10Marostegui) [07:05:21] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1109.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:06:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1109.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:06:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:06:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1109.eqiad.wmnet [07:06:38] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1109.eqiad.wmnet` - db1109.eqiad.wmnet (**WARN**) - Downtimed... [07:07:22] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 (10Marostegui) a:05Marostegui→03Jclark-ctr [07:07:36] 10ops-eqiad, 10decommission-hardware: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 (10Marostegui) This is ready for DC-Ops [07:07:47] 10ops-eqiad, 10decommission-hardware: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 (10Marostegui) [07:10:29] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:14:48] jouncebot: nowandnext [07:14:48] For the next 0 hour(s) and 45 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T0700) [07:14:48] In 0 hour(s) and 45 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T0800) [07:16:25] !log added requestctl rule for T332061 in logging mode [07:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:21] (03PS1) 10Zabe: Disable Graph extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909407 (https://phabricator.wikimedia.org/T334895) [07:18:23] (03CR) 10Zabe: [C: 03+2] Disable Graph extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909407 (https://phabricator.wikimedia.org/T334895) (owner: 10Zabe) [07:18:27] !log zabe@deploy2002 Started scap: T334895 [07:18:28] !log zabe@deploy2002 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki=aawiki --force-version "1.41.0-wmf.4" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.c2xgrltrG8"' returned non-zero exit status 255. (duration: 00m 01s) [07:19:08] (03Merged) 10jenkins-bot: Disable Graph extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909407 (https://phabricator.wikimedia.org/T334895) (owner: 10Zabe) [07:19:53] (03PS1) 10Zabe: Only enable Kartographer if Graph is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909408 [07:19:55] (03CR) 10Zabe: [C: 03+2] Only enable Kartographer if Graph is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909408 (owner: 10Zabe) [07:20:05] !log zabe@deploy2002 Started scap: T334895 [07:20:06] !log zabe@deploy2002 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki=aawiki --force-version "1.41.0-wmf.4" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.8ZJFnr01rx"' returned non-zero exit status 255. (duration: 00m 00s) [07:20:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:20:55] (03Merged) 10jenkins-bot: Only enable Kartographer if Graph is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909408 (owner: 10Zabe) [07:24:11] !log zabe@deploy2002 Started scap: T334895 [07:26:54] (03CR) 10Zabe: [C: 03+2] Only enable Kartographer if Graph is enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909408 (owner: 10Zabe) [07:30:48] !log zabe@deploy2002 Finished scap: T334895 (duration: 06m 37s) [07:31:07] (03PS1) 10Zabe: Disable Kartographer aswell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909409 [07:31:09] (03CR) 10Zabe: [C: 03+2] Disable Kartographer aswell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909409 (owner: 10Zabe) [07:31:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:31:28] (03CR) 10Zabe: [C: 03+2] "(already deployed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909409 (owner: 10Zabe) [07:31:55] (03CR) 10CI reject: [V: 04-1] Disable Kartographer aswell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909409 (owner: 10Zabe) [07:33:30] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:22] (03PS2) 10Slyngshede: Sphinx: Start work on documentation [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 [07:34:34] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:35:11] !log zabe@deploy2002 Started scap: T334895 [07:35:24] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics-external_4692: Servers kubernetes1008.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org [07:35:24] al [07:35:30] (03PS2) 10Urbanecm: [Growth] Prepare for a Personalized praise config variable change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630) [07:36:16] (MediaWikiHighErrorRate) resolved: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:36:19] (03CR) 10CI reject: [V: 04-1] [Growth] Prepare for a Personalized praise config variable change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:36:24] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics-external_4692: Servers kubernetes2020.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled ht [07:36:24] itech.wikimedia.org/wiki/PyBal [07:36:56] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics-external_4692: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled ht [07:36:56] itech.wikimedia.org/wiki/PyBal [07:37:14] zabe: https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json&smtype=language&smlangprop=dir%7Ccode%7Csite&smsiteprop=dbname&formatversion=2 looks it might be caused by your merge? [07:37:21] (as in, the internal server error) [07:37:48] yes, on it [07:38:21] disabling graph also disabled JsonConfig (which I'm not sure is necessary): https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/CommonSettings.php#3592 [07:38:22] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:38:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:39:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:39:32] (JobUnavailable) firing: (2) Reduced availability for job swagger_check_eventgate_analytics_external_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:39:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:39:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:40:14] what's up again? [07:40:22] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:40:35] is that me? [07:40:52] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:40:57] 500 errors increasing since 10 minutes ago [07:41:10] zabe: dunno, did you make a change? [07:41:12] that's zabe's sync very likely [07:41:15] https://grafana.wikimedia.org/goto/gcIIt5PVz?orgId=1 [07:41:18] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:41:28] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:41:37] (03CR) 10Zabe: [C: 03+2] "retry (already deployed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909409 (owner: 10Zabe) [07:41:40] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:41:49] timeline matches zabe@deploy2002 Finished scap: T334895 (duration: 06m 37s) [07:41:52] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:41:53] !log zabe@deploy2002 Finished scap: T334895 (duration: 06m 42s) [07:41:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:42:20] (03Merged) 10jenkins-bot: Disable Kartographer aswell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909409 (owner: 10Zabe) [07:42:38] (03PS1) 10Zabe: Only enable Dashiki when Graph is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909410 [07:42:40] (03CR) 10Zabe: [C: 03+2] Only enable Dashiki when Graph is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909410 (owner: 10Zabe) [07:42:52] (03CR) 10Zabe: [C: 03+2] "(already deployed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909410 (owner: 10Zabe) [07:43:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:43:11] we're back to normal, I think [07:43:20] my req works now [07:43:29] (03Merged) 10jenkins-bot: Only enable Dashiki when Graph is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909410 (owner: 10Zabe) [07:43:31] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:43:47] zabe: can you fill me on what your change was? [07:43:57] XioNoX: https://phabricator.wikimedia.org/T334895 [07:44:12] thx [07:44:30] so it's the sync or the change itself that causedthe issue? [07:44:32] (JobUnavailable) firing: (2) Reduced availability for job swagger_check_eventgate_analytics_external_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:44:43] XioNoX: it's unintended consequences on MW's end (disabling more things than desired) [07:44:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:45:43] I disabled an extension which a second extension depended on which did not automatically got disabled [07:46:07] hey [07:46:11] thanks [07:46:14] wth is happening with mw-on-k8s [07:46:17] joe ^ [07:46:25] what's up with analytics now [07:46:51] see _security please [07:46:51] pods are up [07:47:15] !incidents [07:47:15] You're not allowed to perform this action. [07:47:19] great [07:47:29] !incidents [07:47:30] 3548 (RESOLVED) HaproxyUnavailable cache_text global sre () [07:47:30] 3547 (RESOLVED) HaproxyUnavailable cache_text global sre () [07:47:30] 3546 (RESOLVED) [15x] ProbeDown sre (probes/service) [07:48:27] thx [07:48:50] (03PS1) 10Marostegui: instances.yaml: Add db1212 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909601 (https://phabricator.wikimedia.org/T326669) [07:48:54] !log cgoubert@deploy2002 Started scap: Forcing redeplou [07:49:21] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1212 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909601 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:50:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1212 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P47055 and previous config saved to /var/cache/conftool/dbconfig/20230418-075032-marostegui.json [07:50:38] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:51:25] claime: sorry I was afk for 10 minutes [07:51:25] !log cgoubert@deploy2002 Finished scap: Forcing redeplou (duration: 02m 31s) [07:51:27] (03PS1) 10Zabe: Add separate config for enabling JsonConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 [07:51:29] what's going on? [07:52:09] joe: cd -security [07:52:21] yeah that doesn't tell me much [07:53:08] (03CR) 10Majavah: [C: 03+1] Add separate config for enabling JsonConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 (owner: 10Zabe) [07:59:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:59:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:04] jnuche and ^demon: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T0800) [08:00:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:26] I'd delay the train now (see _security) [08:00:47] unless we're clear all is back now. [08:01:25] I would like to reenable JsonConfig: https://gerrit.wikimedia.org/r/909603 [08:01:56] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 (owner: 10Zabe) [08:02:08] urbanecm: thx for the headsup, will wait for the good to go [08:02:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 (owner: 10Zabe) [08:03:52] (03Merged) 10jenkins-bot: Add separate config for enabling JsonConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 (owner: 10Zabe) [08:04:44] !log zabe@deploy2002 Started scap: Backport for [[gerrit:909603|Add separate config for enabling JsonConfig]] [08:06:04] !log zabe@deploy2002 zabe: Backport for [[gerrit:909603|Add separate config for enabling JsonConfig]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [08:06:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:08:17] !log repooling wdqs2011 [08:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:43] (03Abandoned) 10Ladsgroup: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819519 (owner: 10Gerrit maintenance bot) [08:09:48] (03Abandoned) 10Ladsgroup: db1121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768658 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [08:11:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:11:26] (03PS1) 10Elukey: amd-gpu-tester: add rccl package (ROCm suite) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909604 (https://phabricator.wikimedia.org/T333009) [08:12:28] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:909603|Add separate config for enabling JsonConfig]] (duration: 07m 43s) [08:15:06] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: add rccl package (ROCm suite) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909604 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:19:30] (03PS1) 10Clément Goubert: service: add comment for spicerack field addition [puppet] - 10https://gerrit.wikimedia.org/r/909605 [08:24:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:24:43] urbanecm: issue seems resolved now according to _security, will deploy in 5m if there's no objection [08:24:52] no objections [08:30:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:30:31] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909606 (https://phabricator.wikimedia.org/T330211) [08:30:33] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909606 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [08:30:54] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:18] (03CR) 10Marostegui: [C: 03+1] "I will merge when ready" [puppet] - 10https://gerrit.wikimedia.org/r/909324 (https://phabricator.wikimedia.org/T334455) (owner: 10Jcrespo) [08:31:29] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909606 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [08:31:49] (03CR) 10Ladsgroup: [C: 03+1] "Let me know if I can help too" [puppet] - 10https://gerrit.wikimedia.org/r/909324 (https://phabricator.wikimedia.org/T334455) (owner: 10Jcrespo) [08:32:31] (03CR) 10Ladsgroup: [C: 03+1] "FWIW:" [puppet] - 10https://gerrit.wikimedia.org/r/909324 (https://phabricator.wikimedia.org/T334455) (owner: 10Jcrespo) [08:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:33:38] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:33:52] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:30] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:34:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:35:10] RECOVERY - Host irc2001 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [08:35:38] PROBLEM - ircecho bot process on irc2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [08:35:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:37:12] RECOVERY - ircecho bot process on irc2001 is OK: PROCS OK: 1 process with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [08:37:19] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-worker1110.eqiad.wmnet with reason: Upgrading RAID controller firmware [08:37:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-worker1110.eqiad.wmnet with reason: Upgrading RAID controller firmware [08:38:10] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.5 refs T330211 [08:38:15] T330211: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211 [08:38:35] (03CR) 10TheDJ: Add separate config for enabling JsonConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 (owner: 10Zabe) [08:39:32] (JobUnavailable) resolved: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:40:20] RECOVERY - Host orespoolcounter2004 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [08:40:36] PROBLEM - Check systemd state on orespoolcounter2004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:38] (ProbeDown) resolved: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:42:03] (03CR) 10Zabe: Add separate config for enabling JsonConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 (owner: 10Zabe) [08:42:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10Clement_Goubert) [08:42:10] RECOVERY - Check systemd state on orespoolcounter2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:30] has anybody rebooted orespoolcounter2004? [08:42:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [08:42:36] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10cmooney) p:05Triage→03Low [08:42:54] (03PS1) 10KartikMistry: Enable Content/Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909607 (https://phabricator.wikimedia.org/T327102) [08:43:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:44:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [08:45:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:49:53] (03PS2) 10Clément Goubert: admin: kmorgan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/907967 (https://phabricator.wikimedia.org/T334432) (owner: 10BCornwall) [08:50:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:51:26] (03PS1) 10Btullis: Disable the gobblin timers temporarily on the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/909608 (https://phabricator.wikimedia.org/T333377) [08:52:24] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [08:53:10] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40708/console" [puppet] - 10https://gerrit.wikimedia.org/r/909608 (https://phabricator.wikimedia.org/T333377) (owner: 10Btullis) [08:57:55] (03CR) 10MVernon: [C: 03+1] admin: kmorgan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/907967 (https://phabricator.wikimedia.org/T334432) (owner: 10BCornwall) [08:59:23] (03PS1) 10Elukey: amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909609 (https://phabricator.wikimedia.org/T333009) [09:00:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:00:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/907967 (https://phabricator.wikimedia.org/T334432) (owner: 10BCornwall) [09:01:32] (03CR) 10Clément Goubert: [C: 03+2] admin: kmorgan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/907967 (https://phabricator.wikimedia.org/T334432) (owner: 10BCornwall) [09:02:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [09:05:13] (03CR) 10MVernon: "Looks reasonable to me, once the systemd unit is deployed everywhere; do we need some sort of versioned approach, or is the plan to just r" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [09:05:31] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [09:09:10] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10Aklapper) >>! In T334432#8786115, @DMburugu wrote: > Approved The Phab account @DMburugu is linked to a [self-created personal SUL account](https://meta.w... [09:11:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:12:47] (03PS1) 10Marostegui: db1118: Remove candidate master note [puppet] - 10https://gerrit.wikimedia.org/r/909611 [09:13:16] (03CR) 10Marostegui: [C: 03+2] db1118: Remove candidate master note [puppet] - 10https://gerrit.wikimedia.org/r/909611 (owner: 10Marostegui) [09:15:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:17:12] (03CR) 10Elukey: amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909609 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:17:27] (03PS2) 10Majavah: O:wmcs::nfs: remove primary_backup::misc and related classes [puppet] - 10https://gerrit.wikimedia.org/r/906783 (https://phabricator.wikimedia.org/T301280) [09:17:29] (03PS2) 10Majavah: P:toolforge::checker: remove showmount check [puppet] - 10https://gerrit.wikimedia.org/r/907134 [09:17:31] (03PS2) 10Majavah: hieradata: openstack: drop NAT exceptions for nfs-tools-project [puppet] - 10https://gerrit.wikimedia.org/r/907135 (https://phabricator.wikimedia.org/T333477) [09:17:33] (03PS1) 10Majavah: O:wmcs::nfs: remove primary_backup::tools and related classes [puppet] - 10https://gerrit.wikimedia.org/r/909612 (https://phabricator.wikimedia.org/T333477) [09:17:35] (03PS1) 10Majavah: labstore: delete backup related classes [puppet] - 10https://gerrit.wikimedia.org/r/909613 [09:17:37] (03PS1) 10Majavah: labstore: stop provisioning backup keys [puppet] - 10https://gerrit.wikimedia.org/r/909614 [09:17:39] (03PS1) 10Majavah: labstore: remove most monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/909615 [09:18:49] (03CR) 10Elukey: [C: 04-1] amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909609 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:20:05] (03PS1) 10Clément Goubert: Revert "admin: kmorgan to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/909629 [09:20:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:20:19] (03PS1) 10Muehlenhoff: Failover irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/909616 (https://phabricator.wikimedia.org/T333377) [09:22:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [09:23:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [09:31:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:31:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:50] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10Clement_Goubert) >>! In T334432#8788540, @Aklapper wrote: >>>! In T334432#8786115, @DMburugu wrote: >> Approved > > The Phab account @DMburugu is linked to a [self-created pers... [09:33:42] (03CR) 10Ladsgroup: [C: 03+1] Failover irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/909616 (https://phabricator.wikimedia.org/T333377) (owner: 10Muehlenhoff) [09:35:01] (03Abandoned) 10Clément Goubert: Revert "admin: kmorgan to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/909629 (owner: 10Clément Goubert) [09:36:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:42:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40709/console" [puppet] - 10https://gerrit.wikimedia.org/r/907939 (owner: 10Majavah) [09:43:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/907939 (owner: 10Majavah) [09:45:35] (03CR) 10Muehlenhoff: [C: 03+2] Failover irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/909616 (https://phabricator.wikimedia.org/T333377) (owner: 10Muehlenhoff) [09:45:57] (03PS1) 10Btullis: Stop the YARN queues temporarily to facilitate switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/909621 (https://phabricator.wikimedia.org/T333377) [09:45:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:26] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10ayounsi) Agreed, the workaround sgtm! [09:52:07] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [09:52:56] PROBLEM - Check systemd state on backup2002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:32] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [09:54:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:56:24] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ArielGlenn) [09:57:14] (03PS4) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) [09:58:17] (03PS2) 10Elukey: amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909609 (https://phabricator.wikimedia.org/T333009) [09:58:49] (03PS1) 10Majavah: Hide raw Graph tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909623 (https://phabricator.wikimedia.org/T334895) [09:59:32] (03CR) 10Elukey: amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909609 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1000) [10:02:08] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909609 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:03:21] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [10:03:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [10:03:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [10:03:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1218 (T333332)', diff saved to https://phabricator.wikimedia.org/P47056 and previous config saved to /var/cache/conftool/dbconfig/20230418-100359-ladsgroup.json [10:04:04] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [10:04:12] RECOVERY - Check systemd state on backup2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T333332)', diff saved to https://phabricator.wikimedia.org/P47057 and previous config saved to /var/cache/conftool/dbconfig/20230418-100612-ladsgroup.json [10:10:24] 10SRE, 10Incident Tooling: wikimediastatus.net help popups are unreadable - https://phabricator.wikimedia.org/T327201 (10Peachey88) [10:17:25] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10eoghan) [10:20:20] RECOVERY - MegaRAID on an-worker1110 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:21:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P47058 and previous config saved to /var/cache/conftool/dbconfig/20230418-102119-ladsgroup.json [10:23:51] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [10:25:51] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=ldap-replica1004.wikimedia.org [10:26:39] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [10:27:57] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40710/console" [puppet] - 10https://gerrit.wikimedia.org/r/902753 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:28:33] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add the ceph keys for the osds on the new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/902753 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:30:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:04] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [10:31:37] (03PS2) 10FNegri: toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_* [puppet] - 10https://gerrit.wikimedia.org/r/909397 (https://phabricator.wikimedia.org/T334925) (owner: 10BryanDavis) [10:34:48] (03CR) 10Jbond: "lgtm, happy to merge as is but will wait to see your response on the inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [10:36:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P47059 and previous config saved to /var/cache/conftool/dbconfig/20230418-103625-ladsgroup.json [10:38:13] (03CR) 10Muehlenhoff: O:wmcs::nfs: remove primary_backup::tools and related classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909612 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [10:38:31] (03CR) 10Majavah: ssh: add support for using a CA for host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [10:38:59] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jbond) [10:39:09] (03CR) 10Majavah: O:wmcs::nfs: remove primary_backup::tools and related classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909612 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [10:39:45] (03CR) 10Majavah: [V: 03+1] apt::repository: use signed-by instead of apt-key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [10:44:38] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10Peachey88) [10:48:59] (03CR) 10Muehlenhoff: [C: 03+1] O:wmcs::nfs: remove primary_backup::tools and related classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909612 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [10:49:37] (03CR) 10Jbond: ssh: add support for using a CA for host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [10:51:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T333332)', diff saved to https://phabricator.wikimedia.org/P47060 and previous config saved to /var/cache/conftool/dbconfig/20230418-105131-ladsgroup.json [10:51:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:51:38] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [10:51:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:51:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:52:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:52:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [10:52:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [10:52:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [10:53:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [10:53:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T333332)', diff saved to https://phabricator.wikimedia.org/P47061 and previous config saved to /var/cache/conftool/dbconfig/20230418-105308-ladsgroup.json [10:54:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40712/console" [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond) [10:55:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T333332)', diff saved to https://phabricator.wikimedia.org/P47062 and previous config saved to /var/cache/conftool/dbconfig/20230418-105523-ladsgroup.json [10:55:57] (03PS3) 10Jbond: puppet::agent: Pass through the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 [10:55:59] (03PS10) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [10:56:02] (03PS12) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [10:56:04] (03PS9) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [10:56:06] (03PS52) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [10:57:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40713/console" [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [10:57:57] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40714/console" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [10:59:37] (03PS1) 10Jcrespo: mariadb: Move db1102, db1116 to spares [puppet] - 10https://gerrit.wikimedia.org/r/909647 (https://phabricator.wikimedia.org/T334927) [10:59:51] (03PS8) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [11:00:41] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [11:00:55] !log puppet cert clean kafka_jumbo-eqiad_broker on puppetmaster1001 - remove old certificate (not used anymore) [11:00:58] (03PS53) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 [11:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:35] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [11:02:05] jouncebot: nowandnext [11:02:05] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [11:02:05] In 1 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1300) [11:02:05] In 1 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1300) [11:02:12] (03PS3) 10Urbanecm: [Growth] Prepare for a Personalized praise config variable change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630) [11:02:30] (03PS4) 10Majavah: ssh: add support for using a CA for host keys [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) [11:02:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [11:03:08] (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [11:03:27] (03Merged) 10jenkins-bot: [Growth] Prepare for a Personalized praise config variable change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908365 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [11:03:49] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:908365|[Growth] Prepare for a Personalized praise config variable change (T334630)]] [11:03:54] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [11:04:30] (03CR) 10Majavah: ssh: add support for using a CA for host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [11:09:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Hide raw Graph tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909623 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [11:09:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40716/console" [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [11:10:10] (03CR) 10Jcrespo: [C: 03+2] mariadb: Move db1102, db1116 to spares [puppet] - 10https://gerrit.wikimedia.org/r/909647 (https://phabricator.wikimedia.org/T334927) (owner: 10Jcrespo) [11:10:27] (03CR) 10Jbond: [V: 03+1 C: 03+2] "LGTM will merge, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [11:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P47063 and previous config saved to /var/cache/conftool/dbconfig/20230418-111029-ladsgroup.json [11:10:32] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:908365|[Growth] Prepare for a Personalized praise config variable change (T334630)]] (duration: 06m 43s) [11:10:39] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [11:11:14] taavi: fyi the ssh ca cr is merged [11:11:20] thanks [11:12:51] (03CR) 10Novem Linguae: [C: 03+1] wikireplicas: drop views for pagetriage_log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884454 (https://phabricator.wikimedia.org/T325519) (owner: 10Majavah) [11:13:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909623 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [11:13:29] (03PS2) 10Majavah: Hide raw Graph tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909623 (https://phabricator.wikimedia.org/T334895) [11:13:42] (03CR) 10TrainBranchBot: "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909623 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [11:14:32] (03Merged) 10jenkins-bot: Hide raw Graph tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909623 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [11:14:39] jbond: thanks! I guess the next step would be to generate a CA key for some WMCS project with a local puppetmaster and enable key signing and see what happens? [11:14:54] !log taavi@deploy2002 Started scap: Backport for [[gerrit:909623|Hide raw Graph tags (T334895)]] [11:16:15] !log taavi@deploy2002 taavi: Backport for [[gerrit:909623|Hide raw Graph tags (T334895)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [11:16:21] !log jynus@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1116.eqiad.wmnet [11:16:32] (03PS5) 10Aklapper: Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) [11:20:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:21:30] (03PS1) 10Majavah: Add some fake SSH CA keys [labs/private] - 10https://gerrit.wikimedia.org/r/909648 (https://phabricator.wikimedia.org/T268344) [11:22:03] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:909623|Hide raw Graph tags (T334895)]] (duration: 07m 09s) [11:22:16] !log jynus@cumin1001 START - Cookbook sre.dns.netbox [11:22:17] taavi: yes sgtm, [11:22:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:22:41] for production we would use pki.discovery.wmnet but i think just generating a local ca would be fine for testing [11:23:19] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for KBach - https://phabricator.wikimedia.org/T334931 (10KBach) [11:24:09] (03PS1) 10Majavah: P:ssh::ca: do not call secret() with undef [puppet] - 10https://gerrit.wikimedia.org/r/909649 [11:24:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] Add some fake SSH CA keys [labs/private] - 10https://gerrit.wikimedia.org/r/909648 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [11:24:31] !log jynus@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1116.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1001" [11:24:33] (03CR) 10CI reject: [V: 04-1] P:ssh::ca: do not call secret() with undef [puppet] - 10https://gerrit.wikimedia.org/r/909649 (owner: 10Majavah) [11:24:34] can cfssl use ssh certificates? [11:24:51] (03PS2) 10Majavah: P:ssh::ca: do not call secret() with undef [puppet] - 10https://gerrit.wikimedia.org/r/909649 [11:25:20] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10DMburugu) Approved [11:25:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P47064 and previous config saved to /var/cache/conftool/dbconfig/20230418-112536-ladsgroup.json [11:26:24] (03CR) 10Jbond: P:ssh::ca: do not call secret() with undef (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909649 (owner: 10Majavah) [11:26:47] (03PS3) 10Majavah: P:ssh::ca: do not call secret() with undef [puppet] - 10https://gerrit.wikimedia.org/r/909649 [11:27:25] (03CR) 10Jbond: [C: 03+2] "cheers" [puppet] - 10https://gerrit.wikimedia.org/r/909649 (owner: 10Majavah) [11:27:33] !log jynus@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1116.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1001" [11:27:33] !log jynus@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:27:34] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1116.eqiad.wmnet [11:27:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:27:41] (03CR) 10Marostegui: "This would still need a manual run from Data Engineering folks on each replica." [puppet] - 10https://gerrit.wikimedia.org/r/884454 (https://phabricator.wikimedia.org/T325519) (owner: 10Majavah) [11:27:59] (03CR) 10Marostegui: [C: 03+1] wikireplicas: drop views for pagetriage_log [puppet] - 10https://gerrit.wikimedia.org/r/884454 (https://phabricator.wikimedia.org/T325519) (owner: 10Majavah) [11:28:01] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:ssh::ca: do not call secret() with undef [puppet] - 10https://gerrit.wikimedia.org/r/909649 (owner: 10Majavah) [11:28:43] 10SRE, 10SRE-swift-storage: Memory exhaustion when uploading large TIFF files by URL - https://phabricator.wikimedia.org/T334814 (10Don-vip) Is it nominal that I can't send chunks larger than 16Kb ? If I send chunks 32 Kb or above mediawiki seems to read the first chunk as an entire tiff file and fails with er... [11:30:21] taavi: re cfssl im not sure id need to check [11:30:59] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1110.eqiad.wmnet [11:32:43] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1110.eqiad.wmnet [11:33:40] (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable the gobblin timers temporarily on the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/909608 (https://phabricator.wikimedia.org/T333377) (owner: 10Btullis) [11:34:53] !log jynus@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1102.eqiad.wmnet [11:36:31] (03CR) 10Btullis: [C: 03+2] Stop the YARN queues temporarily to facilitate switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/909621 (https://phabricator.wikimedia.org/T333377) (owner: 10Btullis) [11:36:42] (03PS9) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [11:38:29] (03PS1) 10Ssingh: hiera: temporarily removed dns1002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/909653 (https://phabricator.wikimedia.org/T333377) [11:39:58] !log jynus@cumin1001 START - Cookbook sre.dns.netbox [11:40:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T333332)', diff saved to https://phabricator.wikimedia.org/P47065 and previous config saved to /var/cache/conftool/dbconfig/20230418-114042-ladsgroup.json [11:40:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [11:40:48] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [11:40:50] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10Aklapper) **[off-topic]** @DMburugu: Hi, if you are WMF staff and use the Phabricator account @DMburugu in staff capacity, could you please connect your [SUL staff account](http... [11:41:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [11:41:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T333332)', diff saved to https://phabricator.wikimedia.org/P47066 and previous config saved to /var/cache/conftool/dbconfig/20230418-114106-ladsgroup.json [11:43:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T333332)', diff saved to https://phabricator.wikimedia.org/P47067 and previous config saved to /var/cache/conftool/dbconfig/20230418-114320-ladsgroup.json [11:44:10] (03PS1) 10Majavah: ssh: add missing -h to ssh key sign command [puppet] - 10https://gerrit.wikimedia.org/r/909655 [11:46:11] (03PS3) 10Urbanecm: [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) [11:46:48] !log jynus@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1102.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1001" [11:48:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_netflow.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:46] !log depooling eqiad due to eqiad row D switches upgrade - T333377 [11:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:51] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [11:49:07] !log jynus@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1102.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1001" [11:49:07] !log jynus@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:08] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1102.eqiad.wmnet [11:50:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:50:55] !log jiji@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 [11:51:16] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade... [11:54:21] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) >>! In T330495#8787624, @jhathaway wrote: > We can either leave it or add one of two patches on top of the foward port. > > # Use YAML.unsafe_load_file, which sho... [11:54:23] (03PS1) 10Jcrespo: mariadb: Remove final references to db1116 & db1102 [puppet] - 10https://gerrit.wikimedia.org/r/909656 (https://phabricator.wikimedia.org/T334927) [11:54:28] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [11:54:39] (Nonwrite HTTP requests with primary DB connections alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [11:56:44] (03CR) 10Jcrespo: [C: 03+1] "CCing also Volans to ping him that his (great and very welcome) improvements on the decom script made some instructions on the decom phab " [puppet] - 10https://gerrit.wikimedia.org/r/909656 (https://phabricator.wikimedia.org/T334927) (owner: 10Jcrespo) [11:57:00] (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove final references to db1116 & db1102 [puppet] - 10https://gerrit.wikimedia.org/r/909656 (https://phabricator.wikimedia.org/T334927) (owner: 10Jcrespo) [11:57:10] (03PS2) 10Jcrespo: mariadb: Remove final references to db1116 & db1102 [puppet] - 10https://gerrit.wikimedia.org/r/909656 (https://phabricator.wikimedia.org/T334927) [11:57:15] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1018.eqiad.wmnet [11:57:27] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase102[5-7].eqiad.wmnet [11:57:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase103[03].eqiad.wmnet [11:58:14] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [11:58:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 (owner: 10Slyngshede) [11:58:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P47068 and previous config saved to /var/cache/conftool/dbconfig/20230418-115827-ladsgroup.json [11:58:48] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10hnowlan) [12:04:42] (03CR) 10Slyngshede: "One unresolved comment, but I think we need to talk about how we want that feature to work. It's a good idea though." [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 (owner: 10Slyngshede) [12:04:56] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10Clement_Goubert) 05Stalled→03Resolved Out of band verification confirmed, marking resolved. [12:12:34] (03PS1) 10Slyngshede: C:idm::deployment enable ldap property editor. [puppet] - 10https://gerrit.wikimedia.org/r/909658 [12:12:55] (03CR) 10Slyngshede: [C: 03+2] Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 (owner: 10Slyngshede) [12:12:57] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 (owner: 10Slyngshede) [12:13:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P47069 and previous config saved to /var/cache/conftool/dbconfig/20230418-121333-ladsgroup.json [12:14:39] (Nonwrite HTTP requests with primary DB connections alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [12:19:05] BGP alerts in eqiad expected as we have stopped bird on dns, doh and durum [12:19:49] (03CR) 10Ssingh: [C: 03+2] hiera: temporarily removed dns1002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/909653 (https://phabricator.wikimedia.org/T333377) (owner: 10Ssingh) [12:21:30] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ssingh) [12:22:00] PROBLEM - Bird Internet Routing Daemon on dns1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:22:14] ^ expeted [12:22:28] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 5 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:22:48] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 5 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:22:54] PROBLEM - Bird Internet Routing Daemon on durum1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:24:00] PROBLEM - Bird Internet Routing Daemon on doh1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:24:09] ^ all expected [12:25:39] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade... [12:26:18] (03CR) 10Jbond: [C: 03+2] ssh: add missing -h to ssh key sign command [puppet] - 10https://gerrit.wikimedia.org/r/909655 (owner: 10Majavah) [12:26:32] !log jiji@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 [12:26:38] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [12:27:14] !log jiji@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 [12:27:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 [12:27:32] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade... [12:27:48] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade... [12:28:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T333332)', diff saved to https://phabricator.wikimedia.org/P47070 and previous config saved to /var/cache/conftool/dbconfig/20230418-122839-ladsgroup.json [12:28:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:28:46] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:28:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T333332)', diff saved to https://phabricator.wikimedia.org/P47071 and previous config saved to /var/cache/conftool/dbconfig/20230418-122903-ladsgroup.json [12:29:30] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CR [12:29:30] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T333332)', diff saved to https://phabricator.wikimedia.org/P47072 and previous config saved to /var/cache/conftool/dbconfig/20230418-123218-ladsgroup.json [12:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:33:58] (03PS1) 10Ssingh: depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/909662 (https://phabricator.wikimedia.org/T333377) [12:34:16] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:34:29] (03PS2) 10Ssingh: depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/909662 (https://phabricator.wikimedia.org/T333377) [12:36:17] !log jiji@cumin1001 START - Cookbook sre.discovery.datacenter status all services in all: None - None [12:36:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [12:38:28] 10SRE, 10Phabricator: Remove phabricator Multi-factor Auth for Atieno - https://phabricator.wikimedia.org/T334480 (10Clement_Goubert) Identity confirmed and 2FA reset. [12:39:48] !log imported puppet 5.5.22-2+deb12u2 for bookworm-wikimedia T330495 [12:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:54] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [12:39:57] (03CR) 10Cathal Mooney: [C: 03+1] depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/909662 (https://phabricator.wikimedia.org/T333377) (owner: 10Ssingh) [12:40:15] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) [12:40:19] 10SRE, 10Phabricator: Remove phabricator Multi-factor Auth for Atieno - https://phabricator.wikimedia.org/T334480 (10Clement_Goubert) 05Open→03Resolved [12:40:31] (03CR) 10Ssingh: [C: 03+2] depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/909662 (https://phabricator.wikimedia.org/T333377) (owner: 10Ssingh) [12:40:52] !log run authdns-update to depool eqiad for switch upgrade [12:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:47] (03PS1) 10Muehlenhoff: Update puppet version to be installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/909663 (https://phabricator.wikimedia.org/T330495) [12:45:31] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno) MFA resolved...and I have signed the L3 Acknowledgement {F36955449} [12:45:44] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno) [12:46:40] (03PS1) 10Jbond: sso: add puppet-dev cert [puppet] - 10https://gerrit.wikimedia.org/r/909664 [12:47:15] (03CR) 10Jbond: [C: 03+2] sso: add puppet-dev cert [puppet] - 10https://gerrit.wikimedia.org/r/909664 (owner: 10Jbond) [12:47:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [12:47:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P47073 and previous config saved to /var/cache/conftool/dbconfig/20230418-124724-ladsgroup.json [12:47:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] sso: add puppet-dev cert [puppet] - 10https://gerrit.wikimedia.org/r/909664 (owner: 10Jbond) [12:49:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [12:53:05] (03CR) 10Muehlenhoff: puppet::agent: Pass through the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (owner: 10Jbond) [12:54:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) a:05Cmjohnson→03Jgreen @Jgreen All firmware up to date on servers. All yours [12:55:10] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [12:56:11] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) 05Stalled→03In progress [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1300). Please do the needful. [13:00:07] No Gerrit patches in the queue for this window AFAICS. [13:00:07] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1300) [13:00:07] xSavitar and raynor: A patch you scheduled for Mobileapps/RESTBase/Wikifeeds is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:22] PROBLEM - PHP opcache health on mw2428 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:00:29] nothing to deploy in the backport window indeed [13:00:29] o/ [13:02:19] !log derick@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [13:02:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P47074 and previous config saved to /var/cache/conftool/dbconfig/20230418-130231-ladsgroup.json [13:03:42] !log derick@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:03:49] (03CR) 10Pmiazga: [C: 03+2] proton: Deploy latest proton image `2023-04-17-085853-production` [deployment-charts] - 10https://gerrit.wikimedia.org/r/909187 (https://phabricator.wikimedia.org/T334825) (owner: 10D3r1ck01) [13:04:53] (03CR) 10Jbond: puppet::agent: Pass through the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (owner: 10Jbond) [13:04:57] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 270 hosts with reason: eqiad row D upgrade [13:06:10] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T334901 (10Papaul) @jcrespo looks like this server is back up again with the interface error so after doing all we had to do on our end i think there is nothing more to do here but to move the server to a 10G connection sine you mention t... [13:06:25] !log upload libapache2-mod-auth-cas_1.2-1+wmf12u1 [13:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:50] !log disabling ping offload on cr1-eqiad and cr2-eqiad in advance of row D switch upgrade T333377 [13:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:57] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [13:07:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [13:07:33] (03PS3) 10Hokwelum: make dumpsdata1006 the xmlfallback host [puppet] - 10https://gerrit.wikimedia.org/r/908995 (https://phabricator.wikimedia.org/T325232) [13:08:51] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [13:09:49] (03Merged) 10jenkins-bot: proton: Deploy latest proton image `2023-04-17-085853-production` [deployment-charts] - 10https://gerrit.wikimedia.org/r/909187 (https://phabricator.wikimedia.org/T334825) (owner: 10D3r1ck01) [13:10:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [13:10:56] !log derick@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [13:11:10] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 270 hosts with reason: eqiad row D upgrade [13:11:30] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7fc7ae6f-d3b2-43ed-b030-194ed6367c80) set by cmooney@cumin1001 for 2:00:00 on 270 host(s... [13:12:11] !log disable puppet fleet wide T333377 [13:12:14] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10cmooney) [13:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:18] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [13:12:21] !log derick@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:13:08] !log derick@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [13:14:26] !log derick@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:15:00] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on asw2-d-eqiad with reason: eqiad row D upgrade [13:15:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on asw2-d-eqiad with reason: eqiad row D upgrade [13:15:36] !log derick@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [13:16:51] !log derick@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:17:07] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e714b564-285e-4f22-b860-267d7c23208d) set by cmooney@cumin1001 for 2:00:00 on 1 host(s)... [13:17:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T333332)', diff saved to https://phabricator.wikimedia.org/P47075 and previous config saved to /var/cache/conftool/dbconfig/20230418-131738-ladsgroup.json [13:17:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:17:49] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:17:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:18:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [13:18:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [13:18:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T333332)', diff saved to https://phabricator.wikimedia.org/P47076 and previous config saved to /var/cache/conftool/dbconfig/20230418-131827-ladsgroup.json [13:20:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T333332)', diff saved to https://phabricator.wikimedia.org/P47078 and previous config saved to /var/cache/conftool/dbconfig/20230418-132042-ladsgroup.json [13:21:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond) [13:21:46] raynor and I are done here. [13:21:47] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10klausman) [13:22:36] !log RESTBase/Proton deployment complete [13:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:59] jouncebot: nowandnext [13:22:59] For the next 0 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1300) [13:22:59] For the next 0 hour(s) and 37 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1300) [13:23:00] In 2 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1600) [13:25:08] !log Rebooting asw2-d-eqiad virtual-chassis (all row D top-of-rack switches) to upgrade JunOS. Row D going down T333377 [13:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:15] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [13:27:47] ok time to update the upgs with all the stuff in flight [13:29:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:44] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [13:29:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 202, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:30:10] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [13:30:12] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 12 down 4: https://wikitech.wikimedia.org/wiki/HAProxy [13:30:24] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 216, down: 6, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:30:28] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:30:43] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:31:16] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:31:21] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:31:26] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:31:30] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1349.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1380.eqiad.wmnet, mw1381.eqiad.wmnet, mw1443.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:32:26] PROBLEM - configured eth on lvs1019 is CRITICAL: ens1f1np1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:32:32] (JobUnavailable) firing: (5) Reduced availability for job chartmuseum in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:32:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:32:46] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 12 down 4: https://wikitech.wikimedia.org/wiki/HAProxy [13:32:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [13:32:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:34:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P47079 and previous config saved to /var/cache/conftool/dbconfig/20230418-133549-ladsgroup.json [13:36:16] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:36:30] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:36:54] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:37:17] (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:37:32] (JobUnavailable) firing: (7) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:37:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:37:46] PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:53] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [13:37:55] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [13:38:02] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service,ferm.service,prometheus-nic-firmware-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:09] (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [13:38:24] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:38:26] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:38:30] ACKing the page [13:38:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:38:44] sukhe: this is due to maint work ? [13:38:44] 09:38:09 <+jinxer-wm> (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - [13:38:46] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:38:55] effie: not sure but I am also attributing it to that for now! [13:38:59] lol [13:39:00] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:39:04] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:39:05] CRITICAL: Generic error: Connection to etcd failed due to MaxRetryError("HTTPSConnectionPool(host='conf1009.eqiad.wmnet', port=4001): [13:39:06] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:08] sukhe: --^ [13:39:12] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:20] MaxRetryError("HTTPSConnectionPool(host='conf1009.eqiad.wmnet', port=4001): [13:39:25] thx [13:39:28] RECOVERY - Host 2620:0:861:4:208:80:155:108 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [13:39:29] thanks elukey, so it is that [13:40:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [13:40:54] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:41:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:41:21] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:41:43] !log restart etcdmirror on conf2005 (down due to conf1009 under maintenance) [13:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:41:54] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:41:58] elukey: <3 [13:42:17] (KafkaUnderReplicatedPartitions) resolved: (3) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:42:17] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) dbproxy[1016-1017] reloaded [13:42:27] (03PS1) 10Ssingh: Revert "depool eqiad" [dns] - 10https://gerrit.wikimedia.org/r/909635 [13:42:31] (03CR) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [13:42:32] RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:32] (JobUnavailable) resolved: (8) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:43:03] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [13:43:08] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:43:13] (EtcdReplicationDown) resolved: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [13:43:44] (03CR) 10Jforrester: [C: 03+1] Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) (owner: 10Aklapper) [13:44:37] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Need hp-health package on Bullseye - https://phabricator.wikimedia.org/T300438 (10MoritzMuehlenhoff) 05Open→03Declined We only have five remaining HP servers (two of them up for immediate decom) and we've already skipped hp-health for Bullseye a... [13:44:52] (03PS1) 10Btullis: Revert "Stop the YARN queues temporarily to facilitate switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/909636 [13:46:16] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:46:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:47:57] !re-enabling ping offload on eqiad CR routers T333377 [13:47:58] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [13:48:16] (03CR) 10Btullis: [C: 03+2] Revert "Stop the YARN queues temporarily to facilitate switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/909636 (owner: 10Btullis) [13:49:35] 10SRE, 10SRE-swift-storage: Memory exhaustion when uploading large TIFF files by URL - https://phabricator.wikimedia.org/T334814 (10Don-vip) I managed to send 5 Mb chunks if the first one is no larger than 16 Kb. But when I send the last chunk I face the same error as I did when uploading by URL: ` 15:34:29.... [13:49:53] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for 270 hosts [13:49:53] (03CR) 10Muehlenhoff: puppet::agent: Pass through the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (owner: 10Jbond) [13:50:18] RECOVERY - Bird Internet Routing Daemon on doh1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:50:33] (03CR) 10Muehlenhoff: [C: 03+2] Update puppet version to be installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/909663 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [13:50:42] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10klausman) [13:50:42] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:50:46] RECOVERY - Bird Internet Routing Daemon on durum1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:50:52] (03PS1) 10Ssingh: Revert "hiera: temporarily removed dns1002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/909637 [13:50:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P47080 and previous config saved to /var/cache/conftool/dbconfig/20230418-135056-ladsgroup.json [13:50:58] RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:28] RECOVERY - Bird Internet Routing Daemon on dns1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:51:29] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica1004.wikimedia.org [13:52:46] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [13:52:56] PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:56] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily removed dns1002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/909637 (owner: 10Ssingh) [13:52:58] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 270 hosts [13:53:28] (03PS1) 10Btullis: Revert "Disable the gobblin timers temporarily on the prod cluster" [puppet] - 10https://gerrit.wikimedia.org/r/909638 [13:54:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:54:22] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:55:05] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for Product Analytics Airflow - https://phabricator.wikimedia.org/T334836 (10Stevemunene) a:03Stevemunene [13:55:50] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for asw2-d-eqiad [13:55:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for asw2-d-eqiad [13:56:19] (03CR) 10Btullis: [C: 03+2] Revert "Disable the gobblin timers temporarily on the prod cluster" [puppet] - 10https://gerrit.wikimedia.org/r/909638 (owner: 10Btullis) [13:57:19] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1110.eqiad.wmnet [13:57:48] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1110.eqiad.wmnet [13:58:08] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:58:40] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) >>! In T330495#8789136, @MoritzMuehlenhoff wrote: >> Both patches are attached and were tested on puppetdb1003. I think the unsafe load is proba... [14:00:31] (Nonwrite HTTP requests with primary DB connections alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [14:01:25] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10cmooney) [14:02:48] PROBLEM - Check whether ferm is active by checking the default input chain on sretest1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:03:02] RECOVERY - configured eth on lvs1019 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:03:53] (03CR) 10Ssingh: [C: 03+2] Revert "depool eqiad" [dns] - 10https://gerrit.wikimedia.org/r/909635 (owner: 10Ssingh) [14:04:33] !log running authdns-update to repool eqiad after switch maint: T333377 [14:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:40] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [14:05:33] 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10MatthewVernon) [14:06:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T333332)', diff saved to https://phabricator.wikimedia.org/P47081 and previous config saved to /var/cache/conftool/dbconfig/20230418-140602-ladsgroup.json [14:06:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [14:06:09] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:06:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase103[03].eqiad.wmnet [14:06:18] (03PS1) 10Clément Goubert: admin: Add atieno to to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/909673 (https://phabricator.wikimedia.org/T333550) [14:06:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [14:06:23] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase102[5-7].eqiad.wmnet [14:06:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T333332)', diff saved to https://phabricator.wikimedia.org/P47082 and previous config saved to /var/cache/conftool/dbconfig/20230418-140626-ladsgroup.json [14:06:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1018.eqiad.wmnet [14:08:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T333332)', diff saved to https://phabricator.wikimedia.org/P47083 and previous config saved to /var/cache/conftool/dbconfig/20230418-140840-ladsgroup.json [14:10:31] (Nonwrite HTTP requests with primary DB connections alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [14:10:41] 10SRE-OnFire, 10Observability-Alerting, 10Sustainability (Incident Followup): api_appserver Average latency exceeded alert fired late when latency was declining again - https://phabricator.wikimedia.org/T334949 (10MatthewVernon) [14:11:23] (03CR) 10Muehlenhoff: SSH Keymanagement, allow user to manage ssh keys. (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [14:17:15] (03CR) 10Clément Goubert: [C: 03+2] Add itamar to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/898675 (https://phabricator.wikimedia.org/T331899) (owner: 10Muehlenhoff) [14:17:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Clement_Goubert) Approval verified [14:17:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Clement_Goubert) [14:19:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Clement_Goubert) 05Stalled→03Resolved I have merged your access request @ItamarWMDE, your access should be functional in the next half hour... [14:21:23] (03CR) 10Pmiazga: [C: 03+1] "Thanks for the update. Looks good - can be merged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/909183 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:22:10] (03PS1) 10Vgutierrez: hiera: Use a single UDS for haproxy<-->varnish traffic [puppet] - 10https://gerrit.wikimedia.org/r/909675 (https://phabricator.wikimedia.org/T333965) [14:23:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P47084 and previous config saved to /var/cache/conftool/dbconfig/20230418-142346-ladsgroup.json [14:25:21] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40717/console" [puppet] - 10https://gerrit.wikimedia.org/r/909675 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez) [14:25:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: cloudvirtlocal1001.eqiad.wmnet tends to get stuck on boot - https://phabricator.wikimedia.org/T334696 (10Andrew) Any update on this? [14:26:03] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Use a single UDS for haproxy<-->varnish traffic [puppet] - 10https://gerrit.wikimedia.org/r/909675 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez) [14:27:14] (03PS3) 10Hnowlan: rest-gateway: Extend proton timeout to 150s [deployment-charts] - 10https://gerrit.wikimedia.org/r/909183 (https://phabricator.wikimedia.org/T334611) [14:27:48] 10SRE, 10Traffic, 10Patch-For-Review: Check if it still makes sense to have 8 varnish sockets being used by HAProxy - https://phabricator.wikimedia.org/T333965 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [14:32:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10RobH) [14:34:47] !log upload python3-pypuppetdb_3.3.3-1_all for bookworm [14:38:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P47085 and previous config saved to /var/cache/conftool/dbconfig/20230418-143852-ladsgroup.json [14:39:05] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for KBach - https://phabricator.wikimedia.org/T334931 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:03Clement_Goubert [14:45:20] (03PS1) 10Clément Goubert: admin: add kbach to ldap_only_users (for wmf) [puppet] - 10https://gerrit.wikimedia.org/r/909681 (https://phabricator.wikimedia.org/T334931) [14:46:44] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07), 10Patch-For-Review: Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) SSH key confirmed out of band. @Atieno As per policy, I also have to bring to your attention the [[ https... [14:47:16] (03PS1) 10JMeybohm: k8s: Rename kubernetes_cluster_groups to kubernetes_clusters [puppet] - 10https://gerrit.wikimedia.org/r/909686 (https://phabricator.wikimedia.org/T325268) [14:47:18] (03PS1) 10JMeybohm: WIP: Make kubernetes_clusters the default place for config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) [14:47:28] (03CR) 10CI reject: [V: 04-1] k8s: Rename kubernetes_cluster_groups to kubernetes_clusters [puppet] - 10https://gerrit.wikimedia.org/r/909686 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:47:35] (03CR) 10CI reject: [V: 04-1] WIP: Make kubernetes_clusters the default place for config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:49:13] (03PS2) 10JMeybohm: k8s: Rename kubernetes_cluster_groups to kubernetes_clusters [puppet] - 10https://gerrit.wikimedia.org/r/909686 (https://phabricator.wikimedia.org/T325268) [14:49:15] (03PS2) 10JMeybohm: WIP: Make kubernetes_clusters the default place for config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) [14:49:17] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: Extend proton timeout to 150s [deployment-charts] - 10https://gerrit.wikimedia.org/r/909183 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:50:53] (03PS1) 10Vgutierrez: hiera: Use one socket on haproxy<-->varnish@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/909688 (https://phabricator.wikimedia.org/T333965) [14:51:57] (03CR) 10Vgutierrez: [C: 03+1] hiera: lvs/balancer: unify hiera post bullseye upgrade (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908909 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:52:53] (03CR) 10Herron: [C: 03+1] "much appreciated, this is a big improvement over the process used before" [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [14:52:55] (03CR) 10Vgutierrez: [C: 03+2] hiera: Use one socket on haproxy<-->varnish@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/909688 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez) [14:53:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T333332)', diff saved to https://phabricator.wikimedia.org/P47086 and previous config saved to /var/cache/conftool/dbconfig/20230418-145359-ladsgroup.json [14:54:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [14:54:04] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) [14:54:05] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:54:15] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [14:54:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [14:54:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T333332)', diff saved to https://phabricator.wikimedia.org/P47087 and previous config saved to /var/cache/conftool/dbconfig/20230418-145422-ladsgroup.json [14:54:28] (03Merged) 10jenkins-bot: rest-gateway: Extend proton timeout to 150s [deployment-charts] - 10https://gerrit.wikimedia.org/r/909183 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T333332)', diff saved to https://phabricator.wikimedia.org/P47088 and previous config saved to /var/cache/conftool/dbconfig/20230418-145637-ladsgroup.json [14:57:56] (03PS2) 10KartikMistry: Enable Content/Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909607 (https://phabricator.wikimedia.org/T327102) [15:00:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/909681 (https://phabricator.wikimedia.org/T334931) (owner: 10Clément Goubert) [15:01:15] (03CR) 10Clément Goubert: [C: 03+2] admin: add kbach to ldap_only_users (for wmf) [puppet] - 10https://gerrit.wikimedia.org/r/909681 (https://phabricator.wikimedia.org/T334931) (owner: 10Clément Goubert) [15:04:01] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T334901 (10jcrespo) No issue here. Please note those errors don't get transmitted up into the stack, and tcp fixes whatever errors you are seeing (data is hashed at app layer and we see no issue or slowdown). So this is not a current conce... [15:05:03] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/908884 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [15:06:39] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for KBach - https://phabricator.wikimedia.org/T334931 (10Clement_Goubert) 05In progress→03Resolved Access granted, and added to #wmf-nda Feel free to reopen if there are any issue [15:07:04] !log repooling all eqiad active active services post T333377 [15:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:09] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [15:07:56] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 [15:08:11] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [15:11:38] (03PS1) 10Lucas Werkmeister (WMDE): Add placeholder content for Graph being offline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) [15:11:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P47089 and previous config saved to /var/cache/conftool/dbconfig/20230418-151143-ladsgroup.json [15:13:27] (03CR) 10Eevans: Do not de-init node prior to restart (031 comment) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [15:14:33] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [15:18:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40719/console" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:19:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Jhancock.wm) 05Open→03Resolved no new log entries. closing. [15:19:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) Acknowledged, thank you for checking :) [15:22:25] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10AndrewTavis_WMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tas... [15:22:52] (03PS1) 10FNegri: ToolsDB: remove replication filters [puppet] - 10https://gerrit.wikimedia.org/r/909695 (https://phabricator.wikimedia.org/T328691) [15:23:50] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10karapayneWMDE) As the EM of wikidata at WMDE, I approve this request [15:24:30] (03PS3) 10JMeybohm: WIP: Make kubernetes_clusters the default place for config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) [15:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P47090 and previous config saved to /var/cache/conftool/dbconfig/20230418-152649-ladsgroup.json [15:28:14] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) p:05Triage→03Medium [15:29:42] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10AndrewTavis_WMDE) Thank you, @Clement_Goubert! [15:32:26] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) a:03Clement_Goubert Thanks @karapayneWMDE approval noted. @KFrancis can you handle the NDA signing please ? [15:33:35] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07), 10Patch-For-Review: Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) a:05Atieno→03Clement_Goubert [15:33:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10jhathaway) >>! In T330495#8789702, @MoritzMuehlenhoff wrote: > I have pushed the new build as 5.5.22-2+deb12u2 to the component. tested on puppetdb1003, works great! [15:34:26] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) 05Open→03In progress [15:37:07] !log disable puppet in A:lvs and A:codfw to test CR 908909 [15:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:15] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Jhancock.wm) First look shows mgmt port is down. tried new cable in same port, new cable in new port, and old cable in old port. Thinking and coming back to this soon. [15:38:29] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377 [15:38:34] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 [15:38:35] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [15:38:41] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [15:38:57] (03CR) 10Ssingh: [C: 03+2] hiera: lvs/balancer: unify hiera post bullseye upgrade (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908909 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:39:13] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [15:41:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T333332)', diff saved to https://phabricator.wikimedia.org/P47091 and previous config saved to /var/cache/conftool/dbconfig/20230418-154156-ladsgroup.json [15:41:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:42:02] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:42:11] (03PS2) 10Lucas Werkmeister (WMDE): Add placeholder content for Graph being disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) [15:42:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:42:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T333332)', diff saved to https://phabricator.wikimedia.org/P47092 and previous config saved to /var/cache/conftool/dbconfig/20230418-154219-ladsgroup.json [15:44:52] (03CR) 10JHathaway: [C: 03+1] "looks reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond) [15:45:37] !log enable puppet in A:lvs and A:codfw to test CR 908909 [15:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T333332)', diff saved to https://phabricator.wikimedia.org/P47093 and previous config saved to /var/cache/conftool/dbconfig/20230418-154635-ladsgroup.json [15:49:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:30] (03CR) 10JHathaway: wmflib: updat ipresolv to work with puppet7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [15:54:41] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377 [15:54:45] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 [15:54:48] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [15:54:54] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [15:55:10] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [15:58:44] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10MSantos) [15:59:40] (03PS1) 10Lucas Werkmeister (WMDE): Add second tracking category for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) [16:00:04] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1600) [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:08] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377 [16:00:12] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 [16:00:19] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [16:00:21] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [16:00:37] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [16:01:40] (03CR) 10JHathaway: core_modules: add core modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond) [16:01:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P47095 and previous config saved to /var/cache/conftool/dbconfig/20230418-160141-ladsgroup.json [16:02:42] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [16:03:00] (03CR) 10Herron: [C: 03+2] kafka-logging: stop kafka service on kafka-logging1002 [puppet] - 10https://gerrit.wikimedia.org/r/907504 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [16:03:35] !log ariel@cumin1001 START - Cookbook sre.hosts.reimage for host htmldumper1001.eqiad.wmnet with OS bullseye [16:03:37] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377 [16:03:41] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 [16:04:01] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [16:04:24] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [16:04:50] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377 [16:07:34] (03CR) 10JHathaway: [C: 03+2] puppet::agent: Pass through the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (owner: 10Jbond) [16:08:06] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 [16:08:10] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: End of maintenance - T333377 [16:08:11] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [16:08:22] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [16:08:34] !log depooling restbase-async from codfw [16:08:38] PROBLEM - IPMI Sensor Status on ganeti2019 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:43] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T3333... [16:09:01] (03PS2) 10Herron: kafka-logging: bring up kafka-logging1005 with node id 1005 [puppet] - 10https://gerrit.wikimedia.org/r/907505 (https://phabricator.wikimedia.org/T326419) [16:09:02] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool restbase-async in codfw: Depool from primary DC following network maintenance [16:09:03] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors [16:09:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors [16:10:10] (03CR) 10Herron: [C: 03+2] kafka-logging: bring up kafka-logging1005 with node id 1005 [puppet] - 10https://gerrit.wikimedia.org/r/907505 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [16:12:48] PROBLEM - IPMI Sensor Status on parse2010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:14:06] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: Depool from primary DC following network maintenance [16:15:47] (03CR) 10Andrew Bogott: [C: 03+1] ToolsDB: remove replication filters [puppet] - 10https://gerrit.wikimedia.org/r/909695 (https://phabricator.wikimedia.org/T328691) (owner: 10FNegri) [16:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P47097 and previous config saved to /var/cache/conftool/dbconfig/20230418-161648-ladsgroup.json [16:18:36] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10phaultfinder) [16:19:38] (03CR) 10Legoktm: "(Depending on urgency) I would suggest we move this into the Graph extension itself with a flag like $wgGraphDisabled like we implemented " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [16:21:02] (03PS2) 10Lucas Werkmeister (WMDE): Add second tracking category for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) [16:22:41] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [16:23:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:24:07] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Kappakayala) Hi @Trizek-WMF, We understand that we should have notified to your team earlier. This is part of the s... [16:28:08] (03CR) 10Lucas Werkmeister (WMDE): Add placeholder content for Graph being disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [16:28:41] (03CR) 10Legoktm: Add placeholder content for Graph being disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [16:29:01] (03PS1) 10Btullis: Add the perccli utility to the new Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) [16:29:37] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on htmldumper1001.eqiad.wmnet with reason: host reimage [16:30:20] (03CR) 10Lucas Werkmeister (WMDE): Add placeholder content for Graph being disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [16:31:22] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40727/console" [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [16:31:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T333332)', diff saved to https://phabricator.wikimedia.org/P47098 and previous config saved to /var/cache/conftool/dbconfig/20230418-163154-ladsgroup.json [16:31:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:32:00] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:32:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:32:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T333332)', diff saved to https://phabricator.wikimedia.org/P47099 and previous config saved to /var/cache/conftool/dbconfig/20230418-163217-ladsgroup.json [16:33:01] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on htmldumper1001.eqiad.wmnet with reason: host reimage [16:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:34:07] (03CR) 10Legoktm: Add placeholder content for Graph being disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [16:34:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T333332)', diff saved to https://phabricator.wikimedia.org/P47100 and previous config saved to /var/cache/conftool/dbconfig/20230418-163432-ladsgroup.json [16:37:04] (03PS2) 10Jdrewniak: [beta cluster] Enable indicators on page load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909396 (https://phabricator.wikimedia.org/T333601) (owner: 10Jdlrobson) [16:38:29] (03PS3) 10Lucas Werkmeister (WMDE): Add placeholder content for Graph being disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) [16:38:31] (03PS3) 10Lucas Werkmeister (WMDE): Add second tracking category for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) [16:38:47] PROBLEM - IPMI Sensor Status on elastic2050 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:39:14] (03CR) 10Lucas Werkmeister (WMDE): Add placeholder content for Graph being disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [16:39:16] (03PS1) 10Papaul: Add new backup node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/909708 (https://phabricator.wikimedia.org/T326965) [16:39:56] (03CR) 10Papaul: [C: 03+2] Add new backup node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/909708 (https://phabricator.wikimedia.org/T326965) (owner: 10Papaul) [16:41:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 3 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Papaul) [16:43:09] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:44:41] !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [16:46:40] Quick question, I haven't deployed to beta in a while, should I schedule a deploy window for a beta config change? e.g: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/909396 [16:47:33] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [16:49:33] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [16:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P47101 and previous config saved to /var/cache/conftool/dbconfig/20230418-164939-ladsgroup.json [16:51:01] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [16:52:05] jan_drewniak: you can use a backport window or just do it outside a window when there's nothing else happening. `scap backport` will work correctly with those as long as they only edit the -labs.php files [16:52:57] taavi: ok thanks! [16:54:26] (03PS5) 10Majavah: openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085 [16:54:28] (03PS5) 10Majavah: openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) [16:54:30] (03PS5) 10Majavah: openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) [16:54:53] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10hnowlan) [16:55:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) [16:55:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 3 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Jhancock.wm) [16:57:06] (03PS1) 10Bartosz Dziewoński: Enable visual enhancements on pages using __NEWSECTIONLINK__ on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909710 (https://phabricator.wikimedia.org/T318596) [16:57:42] (03CR) 10Majavah: [C: 03+1] Add placeholder content for Graph being disabled (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [16:57:50] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host htmldumper1001.eqiad.wmnet with OS bullseye [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1700) [17:00:20] (03PS4) 10Majavah: Add second tracking category for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [17:01:02] (03CR) 10Majavah: [C: 03+1] Add second tracking category for Graph (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [17:01:23] PROBLEM - IPMI Sensor Status on cp2031 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:02:09] (03PS1) 10Majavah: Add temporary message for Graph being disabled [extensions/WikimediaMessages] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909639 (https://phabricator.wikimedia.org/T334895) [17:02:19] (03PS1) 10Majavah: Add temporary message for Graph being disabled [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/909640 (https://phabricator.wikimedia.org/T334895) [17:04:01] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10cmooney) 05Open→03Resolved All works complete, no issues to report. [17:04:13] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10cmooney) [17:04:21] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [17:04:33] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [17:04:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P47102 and previous config saved to /var/cache/conftool/dbconfig/20230418-170445-ladsgroup.json [17:05:57] (03CR) 10Andrew Bogott: [C: 03+2] O:wmcs::nfs: remove primary_backup::misc and related classes [puppet] - 10https://gerrit.wikimedia.org/r/906783 (https://phabricator.wikimedia.org/T301280) (owner: 10Majavah) [17:06:14] (03CR) 10Andrew Bogott: [C: 03+2] P:toolforge::checker: remove showmount check [puppet] - 10https://gerrit.wikimedia.org/r/907134 (owner: 10Majavah) [17:06:58] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: openstack: drop NAT exceptions for nfs-tools-project [puppet] - 10https://gerrit.wikimedia.org/r/907135 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [17:06:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:25] (03PS1) 10Hnowlan: rest-gateway: move timeout config to route-level [deployment-charts] - 10https://gerrit.wikimedia.org/r/909712 [17:07:30] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [17:07:49] (03CR) 10Andrew Bogott: [C: 03+2] O:wmcs::nfs: remove primary_backup::tools and related classes [puppet] - 10https://gerrit.wikimedia.org/r/909612 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [17:08:17] (03CR) 10Andrew Bogott: [C: 03+2] labstore: delete backup related classes [puppet] - 10https://gerrit.wikimedia.org/r/909613 (owner: 10Majavah) [17:08:24] jouncebot: nowandnext [17:08:24] For the next 0 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1700) [17:08:24] In 0 hour(s) and 51 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1800) [17:08:50] (03CR) 10Andrew Bogott: [C: 03+2] labstore: stop provisioning backup keys [puppet] - 10https://gerrit.wikimedia.org/r/909614 (owner: 10Majavah) [17:09:15] (03CR) 10Andrew Bogott: [C: 03+2] labstore: remove most monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/909615 (owner: 10Majavah) [17:11:39] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [17:14:51] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:14:59] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10VirginiaPoundstone) [17:16:07] (03PS1) 10Majavah: hieradata: remove leftover role hieradata [puppet] - 10https://gerrit.wikimedia.org/r/909714 [17:16:32] (03CR) 10FNegri: [C: 03+1] "LGTM, but I'm not familiar with how pt-heartbeat works." [puppet] - 10https://gerrit.wikimedia.org/r/909397 (https://phabricator.wikimedia.org/T334925) (owner: 10BryanDavis) [17:16:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:42] jouncebot: nowandnext [17:18:42] For the next 0 hour(s) and 41 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1700) [17:18:42] In 0 hour(s) and 41 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1800) [17:19:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909639 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [17:19:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/909640 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [17:19:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909641 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [17:19:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/909642 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [17:19:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T333332)', diff saved to https://phabricator.wikimedia.org/P47103 and previous config saved to /var/cache/conftool/dbconfig/20230418-171951-ladsgroup.json [17:19:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [17:19:58] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:20:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [17:20:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:20:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:20:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T333332)', diff saved to https://phabricator.wikimedia.org/P47104 and previous config saved to /var/cache/conftool/dbconfig/20230418-172032-ladsgroup.json [17:22:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T333332)', diff saved to https://phabricator.wikimedia.org/P47105 and previous config saved to /var/cache/conftool/dbconfig/20230418-172247-ladsgroup.json [17:26:47] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@3b8ab60]: (no justification provided) [17:26:59] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@3b8ab60]: (no justification provided) (duration: 00m 12s) [17:27:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [17:28:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [17:34:29] RECOVERY - Check systemd state on doc2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:56] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10colewhite) [17:37:08] (03Merged) 10jenkins-bot: Add temporary message for Graph being disabled [extensions/WikimediaMessages] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909639 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [17:37:10] (03Merged) 10jenkins-bot: Add temporary message for Graph being disabled [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/909640 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [17:37:12] (03Merged) 10jenkins-bot: Add temporary tracking category for Graph being disabled [extensions/WikimediaMessages] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909641 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [17:37:14] (03Merged) 10jenkins-bot: Add temporary tracking category for Graph being disabled [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/909642 (https://phabricator.wikimedia.org/T334895) (owner: 10Majavah) [17:37:40] !log taavi@deploy2002 Started scap: Backport for [[gerrit:909639|Add temporary message for Graph being disabled (T334895)]], [[gerrit:909640|Add temporary message for Graph being disabled (T334895)]], [[gerrit:909641|Add temporary tracking category for Graph being disabled (T334895)]], [[gerrit:909642|Add temporary tracking category for Graph being disabled (T334895)]] [17:37:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P47106 and previous config saved to /var/cache/conftool/dbconfig/20230418-173754-ladsgroup.json [17:43:00] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [17:46:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [17:46:31] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [17:47:20] !log jclark@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [17:47:24] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [17:48:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:49:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: cloudvirtlocal1001.eqiad.wmnet tends to get stuck on boot - https://phabricator.wikimedia.org/T334696 (10Jclark-ctr) @Papaul Updated netbox relocated to other switch [17:52:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:47] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10KFrancis) Hello all, please send Andrew McAllister's current email address to kfrancis@wikimedia.org and I will process the NDA. Thank you! [17:53:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P47107 and previous config saved to /var/cache/conftool/dbconfig/20230418-175301-ladsgroup.json [17:54:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:57:04] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10aaron) The lag spike seemed pretty bad: https:/... [17:59:04] !log taavi@deploy2002 taavi: Backport for [[gerrit:909639|Add temporary message for Graph being disabled (T334895)]], [[gerrit:909640|Add temporary message for Graph being disabled (T334895)]], [[gerrit:909641|Add temporary tracking category for Graph being disabled (T334895)]], [[gerrit:909642|Add temporary tracking category for Graph being disabled (T334895)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1 [17:59:04] 001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [18:00:05] jnuche and ^demon: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T1800). [18:01:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:16] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10karapayneWMDE) Email with email address sent [18:04:26] Hey, can I deploy a quick beta config change before the train starts? [18:04:45] this one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/909396 [18:05:45] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:06:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:24] jan_drewniak: I'm deploying some backports atm, I can do that after the current patches are done [18:06:43] (03CR) 10Majavah: [C: 03+2] [beta cluster] Enable indicators on page load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909396 (https://phabricator.wikimedia.org/T333601) (owner: 10Jdlrobson) [18:06:48] taavi: that would be great! [18:07:32] (03Merged) 10jenkins-bot: [beta cluster] Enable indicators on page load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909396 (https://phabricator.wikimedia.org/T333601) (owner: 10Jdlrobson) [18:07:43] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:08:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) Verified cables for both Servers below are the ports and cable ids @Papaul cloudswift1001 Rack,C8 U35. port {clouds... [18:08:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T333332)', diff saved to https://phabricator.wikimedia.org/P47108 and previous config saved to /var/cache/conftool/dbconfig/20230418-180807-ladsgroup.json [18:08:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [18:08:13] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:08:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [18:08:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T333332)', diff saved to https://phabricator.wikimedia.org/P47109 and previous config saved to /var/cache/conftool/dbconfig/20230418-180830-ladsgroup.json [18:10:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: cloudvirtlocal1001.eqiad.wmnet tends to get stuck on boot - https://phabricator.wikimedia.org/T334696 (10Jclark-ctr) @Andrew see above cable has been moved to cloudsw1-c8-eqiad [18:10:41] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:10:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T333332)', diff saved to https://phabricator.wikimedia.org/P47110 and previous config saved to /var/cache/conftool/dbconfig/20230418-181045-ladsgroup.json [18:13:28] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: make sli uptime use pre-existing metric [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [18:15:13] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:909639|Add temporary message for Graph being disabled (T334895)]], [[gerrit:909640|Add temporary message for Graph being disabled (T334895)]], [[gerrit:909641|Add temporary tracking category for Graph being disabled (T334895)]], [[gerrit:909642|Add temporary tracking category for Graph being disabled (T334895)]] (duration: 37m 33s) [18:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:56] (03CR) 10Majavah: [C: 03+2] Add placeholder content for Graph being disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [18:16:59] (03CR) 10Majavah: [C: 03+2] Add second tracking category for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [18:17:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jgreen) [18:17:52] (03Merged) 10jenkins-bot: Add placeholder content for Graph being disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909693 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [18:17:54] (03Merged) 10jenkins-bot: Add second tracking category for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [18:18:27] !log taavi@deploy2002 Started scap: 909693 and 909700 [18:19:49] !log taavi@deploy2002 taavi: 909693 and 909700 synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [18:20:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:20:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jgreen) [18:21:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:21:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:23:17] 10SRE, 10Infrastructure-Foundations, 10Traffic: Updating Netbox for LVS hosts in eqiad lvs10(1[789]|20) - https://phabricator.wikimedia.org/T334884 (10cmooney) @ssingh thanks for the heads up. The renamed interfaces are definitely a bit of a headache here. Testing in netbox-next for lvs1020 I see two basi... [18:25:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P47111 and previous config saved to /var/cache/conftool/dbconfig/20230418-182551-ladsgroup.json [18:26:03] !log taavi@deploy2002 Finished scap: 909693 and 909700 (duration: 07m 36s) [18:28:27] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:28:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [18:29:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:29:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:30:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:30:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) @Jclark-ctr thanks [18:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:30:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:23] (03CR) 10Andrew Bogott: [C: 03+2] openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085 (owner: 10Majavah) [18:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P47112 and previous config saved to /var/cache/conftool/dbconfig/20230418-184058-ladsgroup.json [18:43:14] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1020 [18:43:21] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs1020 [18:44:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [18:46:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:47:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [18:48:09] (03CR) 10Majavah: "added the foreign keys to the live database" [puppet] - 10https://gerrit.wikimedia.org/r/906085 (owner: 10Majavah) [18:48:53] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for cloudswift100[1-2] - pt1979@cumin2002" [18:49:01] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) The excessive writes caused replag n... [18:49:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [18:49:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for cloudswift100[1-2] - pt1979@cumin2002" [18:49:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:50:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [18:50:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T333332)', diff saved to https://phabricator.wikimedia.org/P47113 and previous config saved to /var/cache/conftool/dbconfig/20230418-185010-ladsgroup.json [18:50:16] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:50:19] 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cabel labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10cmooney) p:05Triage→03Low [18:51:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudswift1001.mgmt.eqiad.wmnet with reboot policy FORCED [18:51:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudswift1002.mgmt.eqiad.wmnet with reboot policy FORCED [18:52:38] 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cabel labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10cmooney) [18:52:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T333332)', diff saved to https://phabricator.wikimedia.org/P47114 and previous config saved to /var/cache/conftool/dbconfig/20230418-185250-ladsgroup.json [18:56:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T333332)', diff saved to https://phabricator.wikimedia.org/P47115 and previous config saved to /var/cache/conftool/dbconfig/20230418-185604-ladsgroup.json [18:56:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [18:56:10] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:56:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [18:56:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T333332)', diff saved to https://phabricator.wikimedia.org/P47116 and previous config saved to /var/cache/conftool/dbconfig/20230418-185627-ladsgroup.json [18:58:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T333332)', diff saved to https://phabricator.wikimedia.org/P47117 and previous config saved to /var/cache/conftool/dbconfig/20230418-185842-ladsgroup.json [19:00:38] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10KFrancis) The NDA has been sent for signatures. I'll confirm when it's complete. [19:00:39] (03PS1) 10Eevans: cassandra: add de-init to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) [19:01:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:58] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [19:03:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:03:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye complete... [19:06:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:01] (03PS2) 10Cmelo: Remove multi organizers feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909401 (https://phabricator.wikimedia.org/T334088) [19:07:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Papaul) [19:07:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: cloudvirtlocal1001.eqiad.wmnet tends to get stuck on boot - https://phabricator.wikimedia.org/T334696 (10Papaul) 05Open→03Resolved first re-image didn't hang cancel it and while the installation was going and relaunched the re-image the second time n... [19:08:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P47118 and previous config saved to /var/cache/conftool/dbconfig/20230418-190756-ladsgroup.json [19:13:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P47119 and previous config saved to /var/cache/conftool/dbconfig/20230418-191348-ladsgroup.json [19:15:03] (03PS1) 10Andrea Denisse: prometheus: Add script to sync prometheus instamces in esams [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [19:15:27] (03PS1) 10Andrew Bogott: Revert "Move cloudvirtlocal1001 back to 'insetup'" [puppet] - 10https://gerrit.wikimedia.org/r/909739 (https://phabricator.wikimedia.org/T329863) [19:15:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:30] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Move cloudvirtlocal1001 back to 'insetup'" [puppet] - 10https://gerrit.wikimedia.org/r/909739 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [19:22:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P47120 and previous config saved to /var/cache/conftool/dbconfig/20230418-192302-ladsgroup.json [19:24:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye [19:24:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1003.eqiad.wmnet... [19:27:02] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10Dzahn) So the code lines where this errors out are: ` $ring_size = $facts['net_driver'][$interface]['driver'] ? {... [19:28:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) @Jclark-ctr there is no network cable connected to both nodes. [19:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P47121 and previous config saved to /var/cache/conftool/dbconfig/20230418-192855-ladsgroup.json [19:29:45] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10Dzahn) code was added back in 2018 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/448506 [19:29:55] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:30:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:40] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [19:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T333332)', diff saved to https://phabricator.wikimedia.org/P47122 and previous config saved to /var/cache/conftool/dbconfig/20230418-193809-ladsgroup.json [19:38:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:38:15] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:38:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:38:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47123 and previous config saved to /var/cache/conftool/dbconfig/20230418-193832-ladsgroup.json [19:40:09] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:40:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47124 and previous config saved to /var/cache/conftool/dbconfig/20230418-194012-ladsgroup.json [19:40:29] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1003.eqiad.wmnet with reason: host reimage [19:43:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1003.eqiad.wmnet with reason: host reimage [19:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T333332)', diff saved to https://phabricator.wikimedia.org/P47125 and previous config saved to /var/cache/conftool/dbconfig/20230418-194401-ladsgroup.json [19:44:07] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:45:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P47126 and previous config saved to /var/cache/conftool/dbconfig/20230418-195518-ladsgroup.json [19:57:52] (03PS1) 10Bartosz Dziewoński: Simplify some more VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909747 [19:58:34] (03PS2) 10Bartosz Dziewoński: Enable visual enhancements on pages using __NEWSECTIONLINK__ on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909710 (https://phabricator.wikimedia.org/T318596) [19:58:43] (03PS3) 10Bartosz Dziewoński: Remove weird VisualEditor config hack from 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905712 [19:58:52] (03PS2) 10Bartosz Dziewoński: Simplify some more VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909747 [19:59:44] (03PS3) 10Bartosz Dziewoński: Simplify some more VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909747 [20:00:07] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230418T2000). nyaa~ [20:00:07] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:27] hi [20:00:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:14] I can deploy :) [20:04:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909710 (https://phabricator.wikimedia.org/T318596) (owner: 10Bartosz Dziewoński) [20:05:04] (03Merged) 10jenkins-bot: Enable visual enhancements on pages using __NEWSECTIONLINK__ on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909710 (https://phabricator.wikimedia.org/T318596) (owner: 10Bartosz Dziewoński) [20:05:17] thanks [20:05:32] !log samtar@deploy2002 Started scap: Backport for [[gerrit:909710|Enable visual enhancements on pages using __NEWSECTIONLINK__ on dewiki (T318596)]] [20:05:38] T318596: Please enable Topic Containers (Usability/discussion activity) on all pages at the German Wikipedia - https://phabricator.wikimedia.org/T318596 [20:06:53] !log samtar@deploy2002 matmarex and samtar: Backport for [[gerrit:909710|Enable visual enhancements on pages using __NEWSECTIONLINK__ on dewiki (T318596)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:07:06] MatmaRex: ready on mwdebug [20:07:53] TheresNoTime: thank you, works as expected [20:08:05] syncing [20:08:51] my other two changes should be no-ops [20:09:30] Should be okay to roll together? [20:10:01] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) @ayounsi has been very helpful with reviewing this patch and it now has a tentative +1 (yay!) In terms of n... [20:10:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P47129 and previous config saved to /var/cache/conftool/dbconfig/20230418-201024-ladsgroup.json [20:10:31] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905712 (owner: 10Bartosz Dziewoński) [20:10:56] (03PS1) 10Jelto: gitlab: add script to create fs and raid for backup partition [puppet] - 10https://gerrit.wikimedia.org/r/909749 (https://phabricator.wikimedia.org/T330172) [20:11:19] (03Merged) 10jenkins-bot: Remove weird VisualEditor config hack from 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905712 (owner: 10Bartosz Dziewoński) [20:11:43] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:12:51] TheresNoTime: yes. sorry, i didn't see the message [20:13:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye [20:13:02] TheresNoTime: i'd like to try them on mwdebug though [20:13:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye complete... [20:13:22] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:909710|Enable visual enhancements on pages using __NEWSECTIONLINK__ on dewiki (T318596)]] (duration: 07m 49s) [20:13:26] T318596: Please enable Topic Containers (Usability/discussion activity) on all pages at the German Wikipedia - https://phabricator.wikimedia.org/T318596 [20:13:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909747 (owner: 10Bartosz Dziewoński) [20:14:18] (03Merged) 10jenkins-bot: Simplify some more VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909747 (owner: 10Bartosz Dziewoński) [20:14:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:14:43] !log samtar@deploy2002 Started scap: Backport for [[gerrit:905712|Remove weird VisualEditor config hack from 2015]], [[gerrit:909747|Simplify some more VisualEditor configuration]] [20:16:03] !log samtar@deploy2002 matmarex and samtar: Backport for [[gerrit:905712|Remove weird VisualEditor config hack from 2015]], [[gerrit:909747|Simplify some more VisualEditor configuration]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:16:06] MatmaRex: both live on mwdebug [20:16:37] looking [20:19:25] TheresNoTime: all good [20:19:31] syncing :) [20:19:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:21:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:04] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:15] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:905712|Remove weird VisualEditor config hack from 2015]], [[gerrit:909747|Simplify some more VisualEditor configuration]] (duration: 10m 32s) [20:25:23] all live :) [20:25:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47130 and previous config saved to /var/cache/conftool/dbconfig/20230418-202530-ladsgroup.json [20:25:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [20:25:36] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:25:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [20:25:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47131 and previous config saved to /var/cache/conftool/dbconfig/20230418-202554-ladsgroup.json [20:26:12] thanks TheresNoTime! [20:26:20] ^^ [20:27:47] (03CR) 10Bking: [C: 03+2] "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/902502 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [20:28:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47132 and previous config saved to /var/cache/conftool/dbconfig/20230418-202833-ladsgroup.json [20:29:56] (03Merged) 10jenkins-bot: elasticsearch: Add node ban logic [cookbooks] - 10https://gerrit.wikimedia.org/r/902502 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [20:30:04] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:30:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:16] !log close UTC late backport window [20:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:37:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:29] (03CR) 10JHathaway: "Thanks for putting this together, I got it running locally some hackery of the pontoon bootstrap script. I needed to run puppetserver ca s" [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [20:43:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P47133 and previous config saved to /var/cache/conftool/dbconfig/20230418-204339-ladsgroup.json [20:45:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:56] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Aklapper) [20:50:49] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Aklapper) @AndrewTavis_WMDE: Please update any WMDE templates not to use custom stuff but only link to https://phabricator.wikimedia.org/project/profile/1564/ - thanks! [20:51:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:18] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [20:54:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [20:58:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P47134 and previous config saved to /var/cache/conftool/dbconfig/20230418-205848-ladsgroup.json [21:01:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:45] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:05:45] (03PS1) 10JHathaway: replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756 [21:06:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:10] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [21:11:13] (03PS2) 10Ladsgroup: lists: Bump the number worker processes to 4 [puppet] - 10https://gerrit.wikimedia.org/r/908896 (owner: 10JHathaway) [21:11:17] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] lists: Bump the number worker processes to 4 [puppet] - 10https://gerrit.wikimedia.org/r/908896 (owner: 10JHathaway) [21:12:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [21:12:18] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [21:13:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47136 and previous config saved to /var/cache/conftool/dbconfig/20230418-211354-ladsgroup.json [21:13:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [21:14:01] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [21:14:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [21:14:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [21:15:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [21:15:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:15:18] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [21:15:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:15:29] (03CR) 10Ladsgroup: P:lists:monitoring: Raise process count for uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert) [21:15:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T333332)', diff saved to https://phabricator.wikimedia.org/P47137 and previous config saved to /var/cache/conftool/dbconfig/20230418-211529-ladsgroup.json [21:16:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T333332)', diff saved to https://phabricator.wikimedia.org/P47138 and previous config saved to /var/cache/conftool/dbconfig/20230418-211808-ladsgroup.json [21:18:46] (03PS2) 10JHathaway: replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756 [21:24:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [21:33:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P47139 and previous config saved to /var/cache/conftool/dbconfig/20230418-213314-ladsgroup.json [21:33:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [21:35:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:08] (03CR) 10Andrea Denisse: "PCC results: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40736/" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [21:36:39] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:38:22] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40740/console" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [21:42:57] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40742/console" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [21:43:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:33] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40743/console" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [21:46:07] (03PS3) 10JHathaway: replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756 [21:46:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:47:13] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:48:00] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40744/console" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [21:48:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P47140 and previous config saved to /var/cache/conftool/dbconfig/20230418-214820-ladsgroup.json [21:48:58] (03CR) 10JHathaway: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [21:49:51] (03CR) 10JHathaway: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/909756/40744/" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [21:51:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:08] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10aaron) >>! In T334023#8791186, @Ladsgroup wrote... [21:57:48] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:01:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T333332)', diff saved to https://phabricator.wikimedia.org/P47141 and previous config saved to /var/cache/conftool/dbconfig/20230418-220327-ladsgroup.json [22:03:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [22:03:34] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [22:03:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [22:03:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1183 (T333332)', diff saved to https://phabricator.wikimedia.org/P47142 and previous config saved to /var/cache/conftool/dbconfig/20230418-220350-ladsgroup.json [22:05:46] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Eevans) I can be the point of contact for [[ https://gerrit.wikimedia.org/g/mediawiki/services/kask | mediawiki/services/kask ]], and am ready when y... [22:06:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T333332)', diff saved to https://phabricator.wikimedia.org/P47143 and previous config saved to /var/cache/conftool/dbconfig/20230418-220629-ladsgroup.json [22:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:21:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P47144 and previous config saved to /var/cache/conftool/dbconfig/20230418-222135-ladsgroup.json [22:21:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:31:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:53] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:32:18] PROBLEM - nova-compute proc minimum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:32:51] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:33:56] 10SRE, 10Commons, 10WMF-General-or-Unknown: some file thumbs fail to purge on upload of a new version - https://phabricator.wikimedia.org/T35672 (10Teslaton) 05Declined→03Open This still happens. I've uploaded a new version of https://commons.wikimedia.org/wiki/File:EPSON_PX-8_-_1.jpg with cropped image... [22:34:45] PROBLEM - nova-compute proc minimum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:53] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:05] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:21] I'm working on ^ [22:35:29] RECOVERY - nova-compute proc minimum on cloudvirt1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:49] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:36:03] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:36:17] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:36:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P47145 and previous config saved to /var/cache/conftool/dbconfig/20230418-223642-ladsgroup.json [22:36:45] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:40:07] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:41:21] RECOVERY - nova-compute proc minimum on cloudvirt1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:41:28] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:46:35] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:47:22] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:47:48] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:50:54] (03CR) 10Dzahn: [C: 03+1] "lgtm! nitpicks: I wouldn't call this a "script to sync", you have auto_sync off after all, so nothing happens automatically and it's just " [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [22:51:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T333332)', diff saved to https://phabricator.wikimedia.org/P47146 and previous config saved to /var/cache/conftool/dbconfig/20230418-225148-ladsgroup.json [22:51:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [22:52:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [22:52:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T333332)', diff saved to https://phabricator.wikimedia.org/P47147 and previous config saved to /var/cache/conftool/dbconfig/20230418-225211-ladsgroup.json [22:52:16] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [22:54:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T333332)', diff saved to https://phabricator.wikimedia.org/P47148 and previous config saved to /var/cache/conftool/dbconfig/20230418-225449-ladsgroup.json [22:56:15] (03CR) 10Cwhite: [C: 03+2] logstash: decouple template_version and ecs.version [puppet] - 10https://gerrit.wikimedia.org/r/906701 (https://phabricator.wikimedia.org/T292585) (owner: 10Cwhite) [23:00:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:51] (03CR) 10Cwhite: [C: 03+1] "haven't tested it, but looks reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond) [23:02:57] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [23:03:07] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:07] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:29] hello hello [23:03:45] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:05:24] o/ [23:08:07] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:07] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:45] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:09:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P47149 and previous config saved to /var/cache/conftool/dbconfig/20230418-230956-ladsgroup.json [23:21:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P47150 and previous config saved to /var/cache/conftool/dbconfig/20230418-232502-ladsgroup.json [23:30:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T333332)', diff saved to https://phabricator.wikimedia.org/P47151 and previous config saved to /var/cache/conftool/dbconfig/20230418-234008-ladsgroup.json [23:40:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [23:40:15] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [23:40:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [23:40:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T333332)', diff saved to https://phabricator.wikimedia.org/P47152 and previous config saved to /var/cache/conftool/dbconfig/20230418-234032-ladsgroup.json [23:43:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [23:44:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T333332)', diff saved to https://phabricator.wikimedia.org/P47153 and previous config saved to /var/cache/conftool/dbconfig/20230418-234410-ladsgroup.json [23:45:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [23:50:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [23:53:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [23:58:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [23:59:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P47154 and previous config saved to /var/cache/conftool/dbconfig/20230418-235916-ladsgroup.json [23:59:41] RECOVERY - PHP opcache health on mw2428 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health